You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/03 18:03:11 UTC

[GitHub] [druid] techdocsmith opened a new pull request #11541: Ingestion

techdocsmith opened a new pull request #11541:
URL: https://github.com/apache/druid/pull/11541


   Refactor ingestion into smaller topics. Clean up some style here and there.
   
   ### Description
   
   This PR breaks down the ingestion landing page topic into smaller topics. Some benefits include:
   - Ingestion concepts like the Data Model, Rollup, and Partitioning now have smaller more focused topics.
   - The Ingestion spec reference stands alone as a reference topic.
   - Should make it easier to add conceptual or example material for a specific concept.
   
   There may be some trade off in topic maintenance. For example if there's an enhancement to rollup, we may need to edit `rollup.md` and `ingestion-spec.md`.
   
   This PR has:
   - [x] been self-reviewed.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683708020



##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |

Review comment:
       I wonder if it is better to say here: "perfect or best effort depending on batch ingestion method & other setting -- see rollup for more info"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683699109



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.

Review comment:
       We already said in first paragraph that Druid creates segments. Now we are are saying that they get stored in a special place. I would rephrase this as 
   
   "Segments created by the ingestion process get stored in [deep storage...] which in turn are loaded in Historical nodes by Historical processes in order to respond to Historical queries. See the [Storage design..].." 
   
   At some point the distinction has to be made between queries served by historical processes and those served by real time (i.e. middle manager/indexer) processes. BTW the latter only happens for streaming ingestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683704425



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.
 
 ## Ingestion methods
 
-The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
+The tables below list Druid's most common data ingestion methods, along with comparisons to help you choose

Review comment:
       Below in the table, we introduce "Suervisor type" with no explanation about what a supervisor is...can we link that to where we explain what a supervisor is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683803978



##########
File path: docs/ingestion/data-formats.md
##########
@@ -128,6 +128,8 @@ The CSV `inputFormat` has the following components:
 
 ### TSV (Delimited)
 
+The `inputFormat` to load data of a delimited format. An example is:

Review comment:
       Ok. This was fun 😬 Cleaned it up for the most part, PTAL




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682129212



##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.

Review comment:
       ```suggestion
   Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain time threshold while you continuously add new data from a stream.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683699109



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.

Review comment:
       We already said in first paragraph that Druid creates segments. Now we are are saying that they get stored in a special phase. I would rephrase this as 
   
   "Segments created by the ingestion process get stored in [deep storage...] which in turn are loaded in Historical nodes by Historical processes in order to respond to Historical queries" 
   
   At some point the distinction has to be made between queries served by historical processes and those served by real time (i.e. middle manager/indexer) processes. BTW the latter only happens for streaming ingestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vtlim commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

vtlim commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682074888



##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments have been generated and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.

Review comment:
       ```suggestion
   After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
   ```

##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes start and monitor Hadoop jobs. 

Review comment:
       ```suggestion
   Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.

Review comment:
       ```suggestion
   Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.

Review comment:
       So the user can use _either_ `timestampSpec` or `granularitySpec` as a primary timestamp but not both?

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio

Review comment:
       ```suggestion
       - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.

Review comment:
       ```suggestion
   ## Perfect rollup vs best-effort rollup
   
   Depending on the ingestion method, Druid has the following rollup options:
   - Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
   - _Best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.

Review comment:
       ```suggestion
   This topic describes how to set up partitions within a single datasource. It does not cover how to use multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.
+
+The following table shows how each ingestion method handles partitioning:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
+|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
+|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
+|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how partitions the datasource.. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|

Review comment:
       "defines how partitions the datasource" sounds unclear

##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -1,4 +1,4 @@
----
+  ---

Review comment:
       ```suggestion
   ---
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.

Review comment:
       phrasing seems a bit strange; another option could be "partitioning of and sorting segments" though not much better

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.

Review comment:
       ```suggestion
   Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade-off for the efficiency of rollup, you lose the ability to query individual events.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad-hoc manner.
+
+If you disable [rollup](./rollup.md), then Druid treats the set of
+dimensions like a set of columns to ingest. The dimensions behave exactly as you would expect from any database that does not support a rollup feature.
+
+At ingestion time, you configure dimensions in the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
+
+## Metrics
+
+Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you Specify a metric, you can apply an aggregation function to each row during ingestion. This

Review comment:
       ```suggestion
   Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you specify a metric, you can apply an aggregation function to each row during ingestion. This
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.

Review comment:
       "older than a certain **event**" or "older than a certain **X**, even while" ?

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.

Review comment:
       ```suggestion
   > Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad-hoc manner.

Review comment:
       ```suggestion
   Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad hoc manner.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this for one of the following reasons:
+- The ingestion method parallelizes ingestion without a shuffling step required for perfect rollup.
+- The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received,

Review comment:
       ```suggestion
   - The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-896498405


   This PR got out of sync beyond repair. #11576 carries it. cc: @suneet-s , @sthetland 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683727426



##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+You can control other important operations that are based on the primary timestamp in the
+[`granularitySpec`](./ingestion-spec.md#granularityspec). If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad hoc manner.
+
+If you disable [rollup](./rollup.md), then Druid treats the set of
+dimensions like a set of columns to ingest. The dimensions behave exactly as you would expect from any database that does not support a rollup feature.
+
+At ingestion time, you configure dimensions in the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
+
+## Metrics
+
+Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you specify a metric, you can apply an aggregation function to each row during ingestion. This
+has the following benefits:
+
+- With [rollup](rollup.md) enabled, Druid collapses multiple rows into one row even while retaining summary information. For example, the [rollup tutorial](../tutorials/tutorial-rollup.md) demonstrates using rollup to collapse netflow data to a single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.

Review comment:
       "even" here does not seem appropriate. Maybe this: "Rollup is a form of aggregation that collapses dimensions while aggregating the values in the metrics, that is, it collapses rows but retains its summary information."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683722747



##########
File path: docs/ingestion/data-formats.md
##########
@@ -128,6 +128,8 @@ The CSV `inputFormat` has the following components:
 
 ### TSV (Delimited)
 
+The `inputFormat` to load data of a delimited format. An example is:

Review comment:
       An example for `inputForm` using JSON data is: 
   
   Is it better tp put the example after the table is introduced?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith edited a comment on pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith edited a comment on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-892057136


   @sthetland & @vtlim , @loquisgon  PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686460570



##########
File path: docs/development/extensions-core/protobuf.md
##########
@@ -112,82 +112,86 @@ Important supervisor properties
 - `protoBytesDecoder.descriptor` for the descriptor file URL
 - `protoBytesDecoder.protoMessageType` from the proto definition
 - `protoBytesDecoder.type` set to `file`, indicate use descriptor file to decode Protobuf file
-- `parser` should have `type` set to `protobuf`, but note that the `format` of the `parseSpec` must be `json`
+- `inputFormat` should have `type` set to `protobuf`
 
 ```json
 {
-  "type": "kafka",
-  "dataSchema": {
-    "dataSource": "metrics-protobuf",
-    "parser": {
-      "type": "protobuf",
-      "protoBytesDecoder": {
-        "type": "file",
-        "descriptor": "file:///tmp/metrics.desc",
-        "protoMessageType": "Metrics"
-      },
-      "parseSpec": {
-        "format": "json",
+"type": "kafka",
+"spec": {
+    "dataSchema": {
+        "dataSource": "metrics-protobuf",
         "timestampSpec": {
-          "column": "timestamp",
-          "format": "auto"
+            "column": "timestamp",
+            "format": "auto"
         },
         "dimensionsSpec": {
-          "dimensions": [
-            "unit",
-            "http_method",
-            "http_code",
-            "page",
-            "metricType",
-            "server"
-          ],
-          "dimensionExclusions": [
-            "timestamp",
-            "value"
-          ]
+            "dimensions": [
+                "unit",
+                "http_method",
+                "http_code",
+                "page",
+                "metricType",
+                "server"
+            ],
+            "dimensionExclusions": [
+                "timestamp",
+                "value"
+            ]
+        },
+        "metricsSpec": [
+            {
+                "name": "count",
+                "type": "count"
+            },
+            {
+                "name": "value_sum",
+                "fieldName": "value",
+                "type": "doubleSum"
+            },
+            {
+                "name": "value_min",
+                "fieldName": "value",
+                "type": "doubleMin"
+            },
+            {
+                "name": "value_max",
+                "fieldName": "value",
+                "type": "doubleMax"
+            }
+        ],
+        "granularitySpec": {
+            "type": "uniform",
+            "segmentGranularity": "HOUR",
+            "queryGranularity": "NONE"
         }
-      }
     },
-    "metricsSpec": [
-      {
-        "name": "count",
-        "type": "count"
-      },
-      {
-        "name": "value_sum",
-        "fieldName": "value",
-        "type": "doubleSum"
-      },
-      {
-        "name": "value_min",
-        "fieldName": "value",
-        "type": "doubleMin"
-      },
-      {
-        "name": "value_max",
-        "fieldName": "value",
-        "type": "doubleMax"
-      }
-    ],
-    "granularitySpec": {
-      "type": "uniform",
-      "segmentGranularity": "HOUR",
-      "queryGranularity": "NONE"
-    }
-  },
-  "tuningConfig": {
-    "type": "kafka",
-    "maxRowsPerSegment": 5000000
-  },
-  "ioConfig": {
-    "topic": "metrics_pb",
-    "consumerProperties": {
-      "bootstrap.servers": "localhost:9092"
+    "tuningConfig": {
+        "type": "kafka",
+        "maxRowsPerSegment": 5000000
     },
-    "taskCount": 1,
-    "replicas": 1,
-    "taskDuration": "PT1H"
-  }
+    "ioConfig": {
+        "topic": "metrics_pb",
+        "consumerProperties": {
+            "bootstrap.servers": "localhost:9092"
+        },
+        "inputFormat": {
+            "type": "protobuf",
+            "protoBytesDecoder": {
+                "type": "file",
+                "descriptor": "file:///tmp/metrics.desc",
+                "protoMessageType": "Metrics"
+            },
+            "flattenSpec": {
+                "useFieldDiscovery": true
+            },
+            "binaryAsString": false
+        },
+        "taskCount": 1,
+        "replicas": 1,
+        "taskDuration": "PT1H",
+        "type": "kafka"
+    }
+}

Review comment:
       This passes a JSON parser, so I'm not going to change it b/c it doesn't relate to this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] sthetland commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

sthetland commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686386422



##########
File path: docs/ingestion/index.md
##########
@@ -59,6 +50,8 @@ The most recommended, and most popular, method of streaming ingestion is the
 [Kafka indexing service](../development/extensions-core/kafka-ingestion.md) that reads directly from Kafka. Alternatively, the Kinesis
 indexing service works with Amazon Kinesis Data Streams.
 
+Streaming ingestion uses an onging process called a supervisor that reads from the data stream to ingest data into Druid.

Review comment:
       ```suggestion
   Streaming ingestion uses an ongoing process called a supervisor, which reads from the data stream to ingest data into Druid.
   ```
   Or "Streaming ingestion uses an ongoing process called a supervisor, which ingests data into Druid by reading from data streams."
   
   

##########
File path: docs/ingestion/data-formats.md
##########
@@ -215,21 +218,22 @@ The `inputFormat` to load data of Parquet format. An example is:
 }
 ```
 
-The Parquet `inputFormat` has the following components:
+### Avro Stream
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
+To use the Avro Stream input format load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
 
-### Avro Stream
+For more informtion on how Druid handles Avro types, see [Avro Types](../development/extensions-core/avro.md#avro-types) section for
 
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream input format.
+Configure the Avro `inputFormat` to load Avro data as follows:
 
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from an Avro record. Note that only 'path' expressions are supported; 'jq' is unavailable.| no (default will auto-discover 'root' level properties) |
   ```

##########
File path: docs/querying/query-context.md
##########
@@ -61,6 +61,7 @@ Unless otherwise noted, the following parameters apply to all query types.
 |useFilterCNF|`false`| If true, Druid will attempt to convert the query filter to Conjunctive Normal Form (CNF). During query processing, columns can be pre-filtered by intersecting the bitmap indexes of all values that match the eligible filters, often greatly reducing the raw number of rows which need to be scanned. But this effect only happens for the top level filter, or individual clauses of a top level 'and' filter. As such, filters in CNF potentially have a higher chance to utilize a large amount of bitmap indexes on string columns during pre-filtering. However, this setting should be used with great caution, as it can sometimes have a negative effect on performance, and in some cases, the act of computing CNF of a filter can be expensive. We recommend hand tuning your filters to produce an optimal form if possible, or at least verifying through experimentation that using this parameter actually improves your query performance with no ill-effects.|
 |secondaryPartitionPruning|`true`|Enable secondary partition pruning on the Broker. The Broker will always prune unnecessary segments from the input scan based on a filter on time intervals, but if the data is further partitioned with hash or range partitioning, this option will enable additional pruning based on a filter on secondary partition dimensions.|
 |enableJoinLeftTableScanDirect|`false`|This flag applies to queries which have joins. For joins, where left child is a simple scan with a filter,  by default, druid will run the scan as a query and the join the results to the right child on broker. Setting this flag to true overrides that behavior and druid will attempt to push the join to data servers instead. Please note that the flag could be applicable to queries even if there is no explicit join. since queries can internally translated into a join by the SQL planner.|
+|debug| `false` | Flag indicating whether to enable debugging outputs for the query. When set to false, no additional logs will be produced (logs produced will be entirely dependent on your logging level). When set to true, the following addition logs will be produced:<br />- Log the stack trace of the exception (if any) produced by the query |

Review comment:
       ```suggestion
   |debug| `false` | Flag indicating whether to enable debugging outputs for the query. When set to false, no additional logs will be produced. (Logs produced will be entirely dependent on your logging level.) When set to true, the following additional logs will be produced:<br />- Log the stack trace of the exception (if any) produced by the query |
   ```

##########
File path: docs/ingestion/ingestion-spec.md
##########
@@ -0,0 +1,464 @@
+---
+id: ingestion-spec
+title: Ingestion spec reference
+sidebar_label: Ingestion spec
+description: Reference for the configuration options in the ingestion spec.
+---
+
+All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). All types of ingestion use an _ingestion spec_ to configure ingestion.
+
+Ingestion specs consists of three main components:
+
+- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
+   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
+- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
+   documentation for each [ingestion method](./index.md#ingestion-methods).
+- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
+  [ingestion method](./index.md#ingestion-methods).
+
+Example ingestion spec for task type `index_parallel` (native batch):
+
+```
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia",
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "auto"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          { "type": "string", "page" },
+          { "type": "string", "language" },
+          { "type": "long", "name": "userId" }
+        ]
+      },
+      "metricsSpec": [
+        { "type": "count", "name": "count" },
+        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+      ],
+      "granularitySpec": {
+        "segmentGranularity": "day",
+        "queryGranularity": "none",
+        "intervals": [
+          "2013-08-31/2013-09-01"
+        ]
+      }
+    },
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "local",
+        "baseDir": "examples/indexing/",
+        "filter": "wikipedia_data.json"
+      },
+      "inputFormat": {
+        "type": "json",
+        "flattenSpec": {
+          "useFieldDiscovery": true,
+          "fields": [
+            { "type": "path", "name": "userId", "expr": "$.user.id" }
+          ]
+        }
+      }
+    },
+    "tuningConfig": {
+      "type": "index_parallel"
+    }
+  }
+}
+```
+
+The specific options supported by these sections will depend on the [ingestion method](./index.md#ingestion-methods) you have chosen.
+For more examples, refer to the documentation for each ingestion method.
+
+You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
+available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
+[Kafka](../development/extensions-core/kafka-ingestion.md),
+[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
+[native batch](native-batch.md) mode.
+
+## `dataSchema`
+
+> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
+except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
+
+The `dataSchema` is a holder for the following components:
+
+- [datasource name](#datasource)
+- [primary timestamp](#timestampspec)
+- [dimensions](#dimensionsspec)
+- [metrics](#metricsspec)
+- [transforms and filters](#transformspec) (if needed).
+
+An example `dataSchema` is:
+
+```
+"dataSchema": {
+  "dataSource": "wikipedia",
+  "timestampSpec": {
+    "column": "timestamp",
+    "format": "auto"
+  },
+  "dimensionsSpec": {
+    "dimensions": [
+      { "type": "string", "page" },
+      { "type": "string", "language" },
+      { "type": "long", "name": "userId" }
+    ]
+  },
+  "metricsSpec": [
+    { "type": "count", "name": "count" },
+    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+  ],
+  "granularitySpec": {
+    "segmentGranularity": "day",
+    "queryGranularity": "none",
+    "intervals": [
+      "2013-08-31/2013-09-01"
+    ]
+  }
+}
+```
+
+### `dataSource`
+
+The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
+[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
+`dataSource` is:
+
+```
+"dataSource": "my-first-datasource"
+```
+
+### `timestampSpec`
+
+The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
+configuring the [primary timestamp](./data-model.md#primary-timestamp). An example `timestampSpec` is:
+
+```
+"timestampSpec": {
+  "column": "timestamp",
+  "format": "auto"
+}
+```
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+A `timestampSpec` can have the following components:
+
+|Field|Description|Default|
+|-----|-----------|-------|
+|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
+|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
+|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
+
+### `dimensionsSpec`
+
+The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
+configuring [dimensions](./data-model.md#dimensions). An example `dimensionsSpec` is:
+
+```
+"dimensionsSpec" : {
+  "dimensions": [
+    "page",
+    "language",
+    { "type": "long", "name": "userId" }
+  ],
+  "dimensionExclusions" : [],
+  "spatialDimensions" : []
+}
+```
+
+> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
+> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
+> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
+> your ingestion spec.
+
+A `dimensionsSpec` can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
+| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
+| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
+
+#### Dimension objects
+
+Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
+a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
+
+Dimension objects can have the following components:
+
+| Field | Description | Default |
+|-------|-------------|---------|
+| type | Either `string`, `long`, `float`, or `double`. | `string` |
+| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
+| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
+| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
+
+#### Inclusions and exclusions
+
+Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
+
+Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
+
+Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
+
+1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
+2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.

Review comment:
       ```suggestion
   2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the `flattenSpec`. The `useFieldDiscovery` parameter determines whether the original root-level fields will be retained or discarded.
   ```

##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -22,9 +22,21 @@ title: "Multi-value dimensions"
   ~ under the License.
   -->
 
+<<<<<<< HEAD

Review comment:
       uh oh, this looks familiar. I think lines 27 through 39 can just be replaced with 
   
   ```
   Apache Druid supports "multi-value" string dimensions. These are generated when an input field contains an
   array of values instead of a single value (e.g. JSON arrays, or a TSV field containing one or more `listDelimiter` characters). 
   
   By default Druid stores the values in alphabetical order, see [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for configuration.
   ```

##########
File path: docs/development/extensions-core/protobuf.md
##########
@@ -112,82 +112,86 @@ Important supervisor properties
 - `protoBytesDecoder.descriptor` for the descriptor file URL
 - `protoBytesDecoder.protoMessageType` from the proto definition
 - `protoBytesDecoder.type` set to `file`, indicate use descriptor file to decode Protobuf file
-- `parser` should have `type` set to `protobuf`, but note that the `format` of the `parseSpec` must be `json`
+- `inputFormat` should have `type` set to `protobuf`
 
 ```json
 {
-  "type": "kafka",
-  "dataSchema": {
-    "dataSource": "metrics-protobuf",
-    "parser": {
-      "type": "protobuf",
-      "protoBytesDecoder": {
-        "type": "file",
-        "descriptor": "file:///tmp/metrics.desc",
-        "protoMessageType": "Metrics"
-      },
-      "parseSpec": {
-        "format": "json",
+"type": "kafka",
+"spec": {
+    "dataSchema": {
+        "dataSource": "metrics-protobuf",
         "timestampSpec": {
-          "column": "timestamp",
-          "format": "auto"
+            "column": "timestamp",
+            "format": "auto"
         },
         "dimensionsSpec": {
-          "dimensions": [
-            "unit",
-            "http_method",
-            "http_code",
-            "page",
-            "metricType",
-            "server"
-          ],
-          "dimensionExclusions": [
-            "timestamp",
-            "value"
-          ]
+            "dimensions": [
+                "unit",
+                "http_method",
+                "http_code",
+                "page",
+                "metricType",
+                "server"
+            ],
+            "dimensionExclusions": [
+                "timestamp",
+                "value"
+            ]
+        },
+        "metricsSpec": [
+            {
+                "name": "count",
+                "type": "count"
+            },
+            {
+                "name": "value_sum",
+                "fieldName": "value",
+                "type": "doubleSum"
+            },
+            {
+                "name": "value_min",
+                "fieldName": "value",
+                "type": "doubleMin"
+            },
+            {
+                "name": "value_max",
+                "fieldName": "value",
+                "type": "doubleMax"
+            }
+        ],
+        "granularitySpec": {
+            "type": "uniform",
+            "segmentGranularity": "HOUR",
+            "queryGranularity": "NONE"
         }
-      }
     },
-    "metricsSpec": [
-      {
-        "name": "count",
-        "type": "count"
-      },
-      {
-        "name": "value_sum",
-        "fieldName": "value",
-        "type": "doubleSum"
-      },
-      {
-        "name": "value_min",
-        "fieldName": "value",
-        "type": "doubleMin"
-      },
-      {
-        "name": "value_max",
-        "fieldName": "value",
-        "type": "doubleMax"
-      }
-    ],
-    "granularitySpec": {
-      "type": "uniform",
-      "segmentGranularity": "HOUR",
-      "queryGranularity": "NONE"
-    }
-  },
-  "tuningConfig": {
-    "type": "kafka",
-    "maxRowsPerSegment": 5000000
-  },
-  "ioConfig": {
-    "topic": "metrics_pb",
-    "consumerProperties": {
-      "bootstrap.servers": "localhost:9092"
+    "tuningConfig": {
+        "type": "kafka",
+        "maxRowsPerSegment": 5000000
     },
-    "taskCount": 1,
-    "replicas": 1,
-    "taskDuration": "PT1H"
-  }
+    "ioConfig": {
+        "topic": "metrics_pb",
+        "consumerProperties": {
+            "bootstrap.servers": "localhost:9092"
+        },
+        "inputFormat": {
+            "type": "protobuf",
+            "protoBytesDecoder": {
+                "type": "file",
+                "descriptor": "file:///tmp/metrics.desc",
+                "protoMessageType": "Metrics"
+            },
+            "flattenSpec": {
+                "useFieldDiscovery": true
+            },
+            "binaryAsString": false
+        },
+        "taskCount": 1,
+        "replicas": 1,
+        "taskDuration": "PT1H",
+        "type": "kafka"
+    }
+}

Review comment:
        An extra `}` may have gotten added here. 

##########
File path: docs/ingestion/data-formats.md
##########
@@ -422,11 +427,20 @@ Multiple Instances:
 
 ### Avro OCF
 
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro OCF input format.
+To load the Avro OCF input format, load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
 
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
+See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
 
-The `inputFormat` to load data of Avro OCF format. An example is:
+Configure the Avro OCF `inputFormat` to load Avro OCF data as follows:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |

Review comment:
       ```suggestion
   |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from Avro records. Note that only 'path' expressions are supported; 'jq' is unavailable.| no (default will auto-discover 'root' level properties) |
   ```

##########
File path: docs/ingestion/data-formats.md
##########
@@ -498,18 +513,18 @@ The `inputFormat` to load data of Protobuf format. An example is:
 }
 ```
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `protobuf` to read Protobuf serialized data| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Protobuf record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-|`protoBytesDecoder`| JSON Object |Specifies how to decode bytes to Protobuf record. | yes |
-
 ### FlattenSpec
 
-The `flattenSpec` is located in `inputFormat` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-An example `flattenSpec` is:
+The `flattenSpec` bridges the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. It is a object within the `inputFormat` object.

Review comment:
       ```suggestion
   The `flattenSpec` bridges the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. It is an object within the `inputFormat` object.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -772,10 +773,10 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
 |numShards|Deprecated. Use `partitionsSpec` instead. Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data. `numShards` cannot be specified if `maxRowsPerSegment` is set.|null|no|
 |partitionDimensions|Deprecated. Use `partitionsSpec` instead. The dimensions to partition on. Leave blank to select all dimensions. Only used with `forceGuaranteedRollup` = true, will be ignored otherwise.|null|no|
 |partitionsSpec|Defines how to partition data in each timeChunk, see [PartitionsSpec](#partitionsspec)|`dynamic` if `forceGuaranteedRollup` = false, `hashed` if `forceGuaranteedRollup` = true|no|
-|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](index.md#indexspec)|null|no|
-|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](index.md#indexspec) for possible values.|same as indexSpec|no|
+|indexSpec|Defines segment storage format options to be used at indexing time, see [IndexSpec](ingestion-spec.md#indexspec)|null|no|
+|indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. this can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. however, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see [IndexSpec](ingestion-spec.md#indexspec) for possible values.|same as indexSpec|no|

Review comment:
       ```suggestion
   |indexSpecForIntermediatePersists|Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published. See [`IndexSpec`](ingestion-spec.md#indexspec) for possible values.|same as `indexSpec`|no|
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on pull request #11541: Ingestion

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-892057136


   @sthetland & @vtlim , PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683724068



##########
File path: docs/ingestion/data-formats.md
##########
@@ -128,6 +128,8 @@ The CSV `inputFormat` has the following components:
 
 ### TSV (Delimited)
 
+The `inputFormat` to load data of a delimited format. An example is:

Review comment:
       Also this phrase "The `inputFormat` to load data of a delimited format. An example is:" keeps being repeated after each type is introduced...it feel repetitive and unecessary
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-893726585


   All of my comments are content related so I delegate to your criteria if you want to address them or not. I am fine with the separation of sections. so LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683789237



##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |

Review comment:
       Normally I'd prefer the English "Partitions spec" to the code, but the header is `partitionsSpec` so went with that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683699109



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.

Review comment:
       We already said in first paragraph that Druid creates segments. Now we are are saying that they get stored in a special place. I would rephrase this as 
   
   "Segments created by the ingestion process get stored in [deep storage...] which in turn are loaded in Historical nodes by Historical processes in order to respond to Historical queries" 
   
   At some point the distinction has to be made between queries served by historical processes and those served by real time (i.e. middle manager/indexer) processes. BTW the latter only happens for streaming ingestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682069638



##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introducdes rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data your can dramatically reduce the size of data to be store and reducing row counts by potentially orders of magnitude. The trade off for the efficiency of rollup means you lose the ability to query individual events.

Review comment:
       Much better I think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon edited a comment on pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon edited a comment on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-893726585


   All of my comments are content related so I delegate to your criteria if you want to address them or not. I am fine with the separation of sections. so LGTM after you look at my and others comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686470902



##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -22,9 +22,21 @@ title: "Multi-value dimensions"
   ~ under the License.
   -->
 
+<<<<<<< HEAD

Review comment:
       OK I missed this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682075580



##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details.|

Review comment:
       Looks good in `npm run start`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith closed pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith closed pull request #11541:
URL: https://github.com/apache/druid/pull/11541


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683701635



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.

Review comment:
       Insert a paragraph: 
   
   "This topic also describe the two main methods of ingest data in Druid: streaming and batch"...blah blah blah
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683704425



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.
 
 ## Ingestion methods
 
-The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
+The tables below list Druid's most common data ingestion methods, along with comparisons to help you choose

Review comment:
       Below in the table, we introduce "Supervisor type" with no explanation about what a supervisor is...can we link that to where we explain what a supervisor is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683786173



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.
 
 ## Ingestion methods
 
-The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose
+The tables below list Druid's most common data ingestion methods, along with comparisons to help you choose

Review comment:
       There is nowhere in the docs that describes a supervisor 😞 . I added my own brief definition.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686459977



##########
File path: docs/development/extensions-core/protobuf.md
##########
@@ -112,82 +112,86 @@ Important supervisor properties
 - `protoBytesDecoder.descriptor` for the descriptor file URL
 - `protoBytesDecoder.protoMessageType` from the proto definition
 - `protoBytesDecoder.type` set to `file`, indicate use descriptor file to decode Protobuf file
-- `parser` should have `type` set to `protobuf`, but note that the `format` of the `parseSpec` must be `json`
+- `inputFormat` should have `type` set to `protobuf`
 
 ```json
 {
-  "type": "kafka",
-  "dataSchema": {
-    "dataSource": "metrics-protobuf",
-    "parser": {
-      "type": "protobuf",
-      "protoBytesDecoder": {
-        "type": "file",
-        "descriptor": "file:///tmp/metrics.desc",
-        "protoMessageType": "Metrics"
-      },
-      "parseSpec": {
-        "format": "json",
+"type": "kafka",
+"spec": {
+    "dataSchema": {
+        "dataSource": "metrics-protobuf",
         "timestampSpec": {
-          "column": "timestamp",
-          "format": "auto"
+            "column": "timestamp",
+            "format": "auto"
         },
         "dimensionsSpec": {
-          "dimensions": [
-            "unit",
-            "http_method",
-            "http_code",
-            "page",
-            "metricType",
-            "server"
-          ],
-          "dimensionExclusions": [
-            "timestamp",
-            "value"
-          ]
+            "dimensions": [
+                "unit",
+                "http_method",
+                "http_code",
+                "page",
+                "metricType",
+                "server"
+            ],
+            "dimensionExclusions": [
+                "timestamp",
+                "value"
+            ]
+        },
+        "metricsSpec": [
+            {
+                "name": "count",
+                "type": "count"
+            },
+            {
+                "name": "value_sum",
+                "fieldName": "value",
+                "type": "doubleSum"
+            },
+            {
+                "name": "value_min",
+                "fieldName": "value",
+                "type": "doubleMin"
+            },
+            {
+                "name": "value_max",
+                "fieldName": "value",
+                "type": "doubleMax"
+            }
+        ],
+        "granularitySpec": {
+            "type": "uniform",
+            "segmentGranularity": "HOUR",
+            "queryGranularity": "NONE"
         }
-      }
     },
-    "metricsSpec": [
-      {
-        "name": "count",
-        "type": "count"
-      },
-      {
-        "name": "value_sum",
-        "fieldName": "value",
-        "type": "doubleSum"
-      },
-      {
-        "name": "value_min",
-        "fieldName": "value",
-        "type": "doubleMin"
-      },
-      {
-        "name": "value_max",
-        "fieldName": "value",
-        "type": "doubleMax"
-      }
-    ],
-    "granularitySpec": {
-      "type": "uniform",
-      "segmentGranularity": "HOUR",
-      "queryGranularity": "NONE"
-    }
-  },
-  "tuningConfig": {
-    "type": "kafka",
-    "maxRowsPerSegment": 5000000
-  },
-  "ioConfig": {
-    "topic": "metrics_pb",
-    "consumerProperties": {
-      "bootstrap.servers": "localhost:9092"
+    "tuningConfig": {
+        "type": "kafka",
+        "maxRowsPerSegment": 5000000
     },
-    "taskCount": 1,
-    "replicas": 1,
-    "taskDuration": "PT1H"
-  }
+    "ioConfig": {
+        "topic": "metrics_pb",
+        "consumerProperties": {
+            "bootstrap.servers": "localhost:9092"
+        },
+        "inputFormat": {
+            "type": "protobuf",
+            "protoBytesDecoder": {
+                "type": "file",
+                "descriptor": "file:///tmp/metrics.desc",
+                "protoMessageType": "Metrics"
+            },
+            "flattenSpec": {
+                "useFieldDiscovery": true
+            },
+            "binaryAsString": false
+        },
+        "taskCount": 1,
+        "replicas": 1,
+        "taskDuration": "PT1H",
+        "type": "kafka"
+    }
+}

Review comment:
       I didn't touch these so no idea how this happened. Not sure why these last 3 even wound up in this PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vtlim commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

vtlim commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682163600



##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.

Review comment:
       ```suggestion
   Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682739247



##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.

Review comment:
       ```suggestion
   You can use segment partitioning and sorting within your Druid datasources to reduce the size of your data and increase performance.
   ```
   Yah that was from the original doc :/ 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682127049



##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.

Review comment:
       No. `granlularitySpec` controls other aspects not controlled by the `timestampspec`. I think that this will be better to move the last sentence starting with "Regardless to line 15 after ingestion time. Then start a new paragraph for `granularitySpec`. I'll make this change after you push yours.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683701635



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.

Review comment:
       Insert a paragraph: 
   
   "This topic also describe the two main methods for  ingesting data in Druid: streaming and batch"...blah blah blah
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] sthetland commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

sthetland commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682009378



##########
File path: docs/ingestion/ingestion-spec.md
##########
@@ -0,0 +1,462 @@
+---
+id: ingestion-spec
+title: Ingestion spec reference
+sidebar_label: Ingestion spec
+description: Reference for the configuration options in the ingestion spec.
+---
+
+All ingestion methods use ingestion tasks to load data in to druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [tasks](tasks.md). For all types of ingestion, the _ingestion spec_ 
+
+Ingestion specs consists of three main components:
+
+- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
+   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
+- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
+   documentation for each [ingestion method](./index.md#ingestion-methods).
+- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
+  [ingestion method](./index.md#ingestion-methods).
+
+Example ingestion spec for task type `index_parallel` (native batch):
+
+```
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia",
+      "timestampSpec": {
+        "column": "timestamp",
+        "format": "auto"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          { "type": "string", "page" },
+          { "type": "string", "language" },
+          { "type": "long", "name": "userId" }
+        ]
+      },
+      "metricsSpec": [
+        { "type": "count", "name": "count" },
+        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
+        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
+      ],
+      "granularitySpec": {
+        "segmentGranularity": "day",
+        "queryGranularity": "none",
+        "intervals": [
+          "2013-08-31/2013-09-01"
+        ]
+      }
+    },
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "local",
+        "baseDir": "examples/indexing/",
+        "filter": "wikipedia_data.json"
+      },
+      "inputFormat": {
+        "type": "json",
+        "flattenSpec": {
+          "useFieldDiscovery": true,
+          "fields": [
+            { "type": "path", "name": "userId", "expr": "$.user.id" }
+          ]
+        }
+      }
+    },
+    "tuningConfig": {
+      "type": "index_parallel"
+    }
+  }
+}
+```
+
+The specific options supported by these sections will depend on the [ingestion method](./index.md#ingestion-methods) you have chosen.
+For more examples, refer to the documentation for each ingestion method.
+
+You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
+available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
+[Kafka](../development/extensions-core/kafka-ingestion.md),
+[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
+[native batch](native-batch.md) mode.
+
+## `dataSchema`
+
+> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
+except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
+
+The `dataSchema` is a holder for the following components:
+
+- [datasource name](#datasource), [primary timestamp](#timestampspec),

Review comment:
       Were these all meant to be bulleted? Currently, it's only one bullet. 

##########
File path: docs/ingestion/data-formats.md
##########
@@ -532,8 +532,8 @@ A `flattenSpec` can have the following components:
 
 | Field | Description | Default |
 |-------|-------------|---------|
-| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./index.md#timestampspec), [`transformSpec`](./index.md#transformspec), [`dimensionsSpec`](./index.md#dimensionsspec), and [`metricsSpec`](./index.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
-| fields | Specifies the fields of interest and how they are accessed. [See below for details.](#field-flattening-specifications) | `[]` |
+| useFieldDiscovery | If true, interpret all root-level fields as available fields for usage by [`timestampSpec`](./ingestion-spec.md#timestampspec), [`transformSpec`](./ingestion-spec.md#transformspec), [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec), and [`metricsSpec`](./ingestion-spec.md#metricsspec).<br><br>If false, only explicitly specified fields (see `fields`) will be available for use. | `true` |
+| fields | Specifies the fields of interest and how they are accessed. [See Field flattening specifications for more detail.](#field-flattening-specifications) | `[]` |

Review comment:
       ```suggestion
   | fields | Specifies the fields of interest and how they are accessed. See [Field flattening specifications](#field-flattening-specifications) for more detail. | `[]` |
   ```

##########
File path: docs/ingestion/ingestion-spec.md
##########
@@ -0,0 +1,462 @@
+---
+id: ingestion-spec
+title: Ingestion spec reference
+sidebar_label: Ingestion spec
+description: Reference for the configuration options in the ingestion spec.
+---
+
+All ingestion methods use ingestion tasks to load data in to druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [tasks](tasks.md). For all types of ingestion, the _ingestion spec_ 

Review comment:
       Looks like the last sentence got cut off. 
   
   ```suggestion
   All ingestion methods use ingestion tasks to load data into Druid. Streaming ingestion uses ongoing supervisors that run and supervise a set of tasks over time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). For all types of ingestion, the _ingestion spec_ 
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introducdes rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.

Review comment:
       ```suggestion
   description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
   ```

##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details.|

Review comment:
       This row is missing the two columns on the right. I think this would fix it, but pls double-check me and test the table after committing:
    
   ```suggestion
   | **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introducdes rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data your can dramatically reduce the size of data to be store and reducing row counts by potentially orders of magnitude. The trade off for the efficiency of rollup means you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimensions](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec)).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this for one of the following reasons:
+- The ingestion method parallelizes ingestion without a shuffling step required for perfect rollup.
+- The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received,
+In both of these cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion run in this mode.
+
+Ingestion methods that guarantee perfect rollup use an additional preprocessing step to determine intervals and partitioning before data ingestion. This preprocessing step scans the entire input dataset. While this step it increases the time required for ingestion, it provides information necessary for perfect rollup.

Review comment:
       ```suggestion
   Ingestion methods that guarantee perfect rollup use an additional preprocessing step to determine intervals and partitioning before data ingestion. This preprocessing step scans the entire input dataset. While this step increases the time required for ingestion, it provides information necessary for perfect rollup.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introducdes rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data your can dramatically reduce the size of data to be store and reducing row counts by potentially orders of magnitude. The trade off for the efficiency of rollup means you lose the ability to query individual events.

Review comment:
       Not sure if this suggestion improves the last sentence, but "means" doesn't seem like quite the right word.  Might take it in another direction, such as, "With rollup, you trade efficiency of storage and retrieval against the ability to query individual events." or something like that.  
   
   ```suggestion
   Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always be stores the timestamp in the `__time` column in your Druid datasource.

Review comment:
       ```suggestion
   [`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introducdes rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data your can dramatically reduce the size of data to be store and reducing row counts by potentially orders of magnitude. The trade off for the efficiency of rollup means you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimensions](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec)).

Review comment:
       ```suggestion
   At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683709100



##########
File path: docs/ingestion/index.md
##########
@@ -88,656 +79,5 @@ This table compares the three available options:
 | **External dependencies** | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | None. |
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). | Any Hadoop FileSystem or Druid datasource. | Any [`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
-| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details. | Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |
-
-<a name="data-model"></a>
-
-## Druid's data model
-
-### Datasources
-
-Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid
-offers a unique data modeling system that bears similarity to both relational and timeseries models.
-
-### Primary timestamp
-
-Druid schemas must always include a primary timestamp. The primary timestamp is used for
-[partitioning and sorting](#partitioning) your data. Druid queries are able to rapidly identify and retrieve data
-corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column
-for time-based [data management operations](data-management.md) such as dropping time chunks, overwriting time chunks,
-and time-based retention rules.
-
-The primary timestamp is parsed based on the [`timestampSpec`](#timestampspec). In addition, the
-[`granularitySpec`](#granularityspec) controls other important operations that are based on the primary timestamp.
-Regardless of which input field the primary timestamp is read from, it will always be stored as a column named `__time`
-in your Druid datasource.
-
-If you have more than one timestamp column, you can store the others as
-[secondary timestamps](schema-design.md#secondary-timestamps).
-
-### Dimensions
-
-Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply
-aggregators to dimensions at query time in an ad-hoc manner. If you run with [rollup](#rollup) disabled, then the set of
-dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical
-database that does not support a rollup feature.
-
-Dimensions are configured through the [`dimensionsSpec`](#dimensionsspec).
-
-### Metrics
-
-Metrics are columns that are stored in an aggregated form. They are most useful when [rollup](#rollup) is enabled.
-Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This
-has two benefits:
-
-1. If [rollup](#rollup) is enabled, multiple rows can be collapsed into one row even while retaining summary
-information. In the [rollup tutorial](../tutorials/tutorial-rollup.md), this is used to collapse netflow data to a
-single row per `(minute, srcIP, dstIP)` tuple, while retaining aggregate information about total packet and byte counts.
-2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if
-they are partially computed at ingestion time.
-
-Metrics are configured through the [`metricsSpec`](#metricsspec).
-
-## Rollup
-
-### What is rollup?
-
-Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is
-a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that
-needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost:
-as we roll up data, we lose the ability to query individual events.
-
-When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar
-to what you would expect from a typical database that does not support a rollup feature.
-
-When rollup is enabled, then any rows that have identical [dimensions](#dimensions) and [timestamp](#primary-timestamp)
-to each other (after [`queryGranularity`-based truncation](#granularityspec)) can be collapsed, or _rolled up_, into a
-single row in Druid.
-
-By default, rollup is enabled.
-
-### Enabling or disabling rollup
-
-Rollup is controlled by the `rollup` setting in the [`granularitySpec`](#granularityspec). By default, it is `true`
-(enabled). Set this to `false` if you want Druid to store each record as-is, without any rollup summarization.
-
-### Example of rollup
-
-For an example of how to configure rollup, and of how the feature will modify your data, check out the
-[rollup tutorial](../tutorials/tutorial-rollup.md).
-
-### Maximizing rollup ratio
-
-You can measure the rollup ratio of a datasource by comparing the number of rows in Druid (`COUNT`) with the number of ingested
-events.  One way to do this is with a
-[Druid SQL](../querying/sql.md) query such as the following, where "count" refers to a `count`-type metric generated at ingestion time:
-
-```sql
-SELECT SUM("count") / (COUNT(*) * 1.0)
-FROM datasource
-```
-
-The higher this number is, the more benefit you are gaining from rollup.
-
-> See [Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about
-how counting works when rollup is enabled.
-
-Tips for maximizing rollup:
-
-- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
-you will achieve.
-- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
-- Adjusting `queryGranularity` at ingestion time (for example, using `PT5M` instead of `PT1M`) increases the
-likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
-- It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full"
-datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that
-has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using
-that datasource leads to much faster query times. This can often be done with just a small increase in storage
-footprint, since abbreviated datasources tend to be substantially smaller.
-- If you are using a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect
-rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by
-[reindexing](data-management.md#reingesting-data) or [compacting](compaction.md) your data in the background after initial ingestion.
-
-### Perfect rollup vs Best-effort rollup
-
-Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
-time. Others offer _best-effort rollup_, meaning that input data might not be perfectly aggregated and thus there could
-be multiple segments holding rows with the same timestamp and dimension values.
-
-In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
-without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
-segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
-cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion
-run in this mode.
-
-Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals
-and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which
-generally increases the time required for ingestion, but provides information necessary for perfect rollup.
-
-The following table shows how each method handles rollup:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|`index_parallel` and `index` type may be either perfect or best-effort, based on configuration.|
-|[Hadoop](hadoop.md)|Always perfect.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Always best-effort.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Always best-effort.|
-
-## Partitioning
-
-### Why partition?
-
-Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and
-performance.
-
-Druid datasources are always partitioned by time into _time chunks_, and each time chunk contains one or more segments.
-This partitioning happens for all ingestion methods, and is based on the `segmentGranularity` parameter of your
-ingestion spec's `dataSchema`.
-
-The segments within a particular time chunk may also be partitioned further, using options that vary based on the
-ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will
-improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed
-quickly.
-
-You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
-dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold
-storage size decreases - and it also tends to improve query performance as well.
-
-> Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider
-> placing it first in the `dimensions` list of your `dimensionsSpec`, which tells Druid to sort rows within each segment
-> by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.
->
-> However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first
-> dimension listed in your `dimensionsSpec`. This can prevent dimension sorting from being maximally effective. If
-> necessary, you can work around this limitation by setting `queryGranularity` equal to `segmentGranularity` in your
-> [`granularitySpec`](#granularityspec), which will set all timestamps within the segment to the same value, and by saving
-> your "real" timestamp as a [secondary timestamp](schema-design.md#secondary-timestamps). This limitation may be removed
-> in a future version of Druid.
-
-### How to set up partitioning
-
-Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of
-flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like
-Kafka) then you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after it
-is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain
-threshold is optimally partitioned, even as you continuously add new data from a stream.
-
-The following table shows how each ingestion method handles partitioning:
-
-|Method|How it works|
-|------|------------|
-|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
-|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
-|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
-
-> Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable
-> approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If
-> you go with this approach, then you can ignore this section, since it is describing how to set up partitioning
-> _within a single datasource_.
->
-> For more details on splitting data up into separate datasources, and potential operational considerations, refer
-> to the [Multitenancy considerations](../querying/multitenancy.md) page.
-
-<a name="spec"></a>
-
-## Ingestion specs
-
-No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.md) or
-ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor
-definition is an _ingestion spec_.
-
-Ingestion specs consists of three main components:
-
-- [`dataSchema`](#dataschema), which configures the [datasource name](#datasource),
-   [primary timestamp](#timestampspec), [dimensions](#dimensionsspec), [metrics](#metricsspec), and [transforms and filters](#transformspec) (if needed).
-- [`ioConfig`](#ioconfig), which tells Druid how to connect to the source system and how to parse data. For more information, see the
-   documentation for each [ingestion method](#ingestion-methods).
-- [`tuningConfig`](#tuningconfig), which controls various tuning parameters specific to each
-  [ingestion method](#ingestion-methods).
-
-Example ingestion spec for task type `index_parallel` (native batch):
-
-```
-{
-  "type": "index_parallel",
-  "spec": {
-    "dataSchema": {
-      "dataSource": "wikipedia",
-      "timestampSpec": {
-        "column": "timestamp",
-        "format": "auto"
-      },
-      "dimensionsSpec": {
-        "dimensions": [
-          "page",
-          "language",
-          { "type": "long", "name": "userId" }
-        ]
-      },
-      "metricsSpec": [
-        { "type": "count", "name": "count" },
-        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-      ],
-      "granularitySpec": {
-        "segmentGranularity": "day",
-        "queryGranularity": "none",
-        "intervals": [
-          "2013-08-31/2013-09-01"
-        ]
-      }
-    },
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "local",
-        "baseDir": "examples/indexing/",
-        "filter": "wikipedia_data.json"
-      },
-      "inputFormat": {
-        "type": "json",
-        "flattenSpec": {
-          "useFieldDiscovery": true,
-          "fields": [
-            { "type": "path", "name": "userId", "expr": "$.user.id" }
-          ]
-        }
-      }
-    },
-    "tuningConfig": {
-      "type": "index_parallel"
-    }
-  }
-}
-```
-
-The specific options supported by these sections will depend on the [ingestion method](#ingestion-methods) you have chosen.
-For more examples, refer to the documentation for each ingestion method.
-
-You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality
-available in Druid's [web console](../operations/druid-console.md). Druid's visual data loader supports
-[Kafka](../development/extensions-core/kafka-ingestion.md),
-[Kinesis](../development/extensions-core/kinesis-ingestion.md), and
-[native batch](native-batch.md) mode.
-
-## `dataSchema`
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See the [Legacy `dataSchema` spec](#legacy-dataschema-spec) for the old spec.
-
-The `dataSchema` is a holder for the following components:
-
-- [datasource name](#datasource), [primary timestamp](#timestampspec),
-  [dimensions](#dimensionsspec), [metrics](#metricsspec), and 
-  [transforms and filters](#transformspec) (if needed).
-
-An example `dataSchema` is:
-
-```
-"dataSchema": {
-  "dataSource": "wikipedia",
-  "timestampSpec": {
-    "column": "timestamp",
-    "format": "auto"
-  },
-  "dimensionsSpec": {
-    "dimensions": [
-      "page",
-      "language",
-      { "type": "long", "name": "userId" }
-    ]
-  },
-  "metricsSpec": [
-    { "type": "count", "name": "count" },
-    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-  ],
-  "granularitySpec": {
-    "segmentGranularity": "day",
-    "queryGranularity": "none",
-    "intervals": [
-      "2013-08-31/2013-09-01"
-    ]
-  }
-}
-```
-
-### `dataSource`
-
-The `dataSource` is located in `dataSchema` → `dataSource` and is simply the name of the
-[datasource](../design/architecture.md#datasources-and-segments) that data will be written to. An example
-`dataSource` is:
-
-```
-"dataSource": "my-first-datasource"
-```
-
-### `timestampSpec`
-
-The `timestampSpec` is located in `dataSchema` → `timestampSpec` and is responsible for
-configuring the [primary timestamp](#primary-timestamp). An example `timestampSpec` is:
-
-```
-"timestampSpec": {
-  "column": "timestamp",
-  "format": "auto"
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `timestampSpec` can have the following components:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|column|Input row field to read the primary timestamp from.<br><br>Regardless of the name of this input field, the primary timestamp will always be stored as a column named `__time` in your Druid datasource.|timestamp|
-|format|Timestamp format. Options are: <ul><li>`iso`: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"</li><li>`posix`: seconds since epoch</li><li>`millis`: milliseconds since epoch</li><li>`micro`: microseconds since epoch</li><li>`nano`: nanoseconds since epoch</li><li>`auto`: automatically detects ISO (either 'T' or space separator) or millis format</li><li>any [Joda DateTimeFormat string](http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html)</li></ul>|auto|
-|missingValue|Timestamp to use for input records that have a null or missing timestamp `column`. Should be in ISO8601 format, like `"2000-01-01T01:02:03.456"`, even if you have specified something else for `format`. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. |none|
-
-### `dimensionsSpec`
-
-The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is responsible for
-configuring [dimensions](#dimensions). An example `dimensionsSpec` is:
-
-```
-"dimensionsSpec" : {
-  "dimensions": [
-    "page",
-    "language",
-    { "type": "long", "name": "userId" }
-  ],
-  "dimensionExclusions" : [],
-  "spatialDimensions" : []
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-A `dimensionsSpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
-| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
-
-#### Dimension objects
-
-Each dimension in the `dimensions` list can either be a name or an object. Providing a name is equivalent to providing
-a `string` type dimension object with the given name, e.g. `"page"` is equivalent to `{"name": "page", "type": "string"}`.
-
-Dimension objects can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, or `double`. | `string` |
-| name | The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.<br><br>Note that you can use a [`transformSpec`](#transformspec) if you want to rename columns during ingestion time. | none (required) |
-| createBitmapIndex | For `string` typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for `string` typed dimensions. | `true` |
-| multiValueHandling | Specify the type of handling for [multi-value fields](../querying/multi-value-dimensions.md). Possible values are `sorted_array`, `sorted_set`, and `array`. `sorted_array` and `sorted_set` order the array upon ingestion. `sorted_set` removes duplicates. `array` ingests data as-is | `sorted_array` |
-
-#### Inclusions and exclusions
-
-Druid will interpret a `dimensionsSpec` in two possible ways: _normal_ or _schemaless_.
-
-Normal interpretation occurs when either `dimensions` or `spatialDimensions` is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of `dimensionExclusions` will be ignored.
-
-Schemaless interpretation occurs when both `dimensions` and `spatialDimensions` are empty or null. In this case, the set of dimensions is determined in the following way:
-
-1. First, start from the set of all root-level fields from the input record, as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes all fields at the top level of a data structure, but does not included fields nested within maps or lists. To extract these, you must use a [`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data formats, such as CSV and delimited text, are considered root-level.
-2. If a [`flattenSpec`](./data-formats.md#flattenspec) is being used, the set of root-level fields includes any fields generated by the flattenSpec. The useFieldDiscovery parameter determines whether the original root-level fields will be retained or discarded.
-3. Any field listed in `dimensionExclusions` is excluded.
-4. The field listed as `column` in the [`timestampSpec`](#timestampspec) is excluded.
-5. Any field used as an input to an aggregator from the [metricsSpec](#metricsspec) is excluded.
-6. Any field with the same name as an aggregator from the [metricsSpec](#metricsspec) is excluded.
-7. All other fields are ingested as `string` typed dimensions with the [default settings](#dimension-objects).
-
-> Note: Fields generated by a [`transformSpec`](#transformspec) are not currently considered candidates for
-> schemaless dimension interpretation.
-
-### `metricsSpec`
-
-The `metricsSpec` is located in `dataSchema` → `metricsSpec` and is a list of [aggregators](../querying/aggregations.md)
-to apply at ingestion time. This is most useful when [rollup](#rollup) is enabled, since it's how you configure
-ingestion-time aggregation.
-
-An example `metricsSpec` is:
-
-```
-"metricsSpec": [
-  { "type": "count", "name": "count" },
-  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
-  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
-]
-```
-
-> Generally, when [rollup](#rollup) is disabled, you should have an empty `metricsSpec` (because without rollup,
-> Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However,
-> in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of
-> pre-computing part of an [approximate aggregation](../querying/aggregations.md#approximate-aggregations), this can only
-> be done by defining a metric in a `metricsSpec`.
-
-### `granularitySpec`
-
-The `granularitySpec` is located in `dataSchema` → `granularitySpec` and is responsible for configuring
-the following operations:
-
-1. Partitioning a datasource into [time chunks](../design/architecture.md#datasources-and-segments) (via `segmentGranularity`).
-2. Truncating the timestamp, if desired (via `queryGranularity`).
-3. Specifying which time chunks of segments should be created, for batch ingestion (via `intervals`).
-4. Specifying whether ingestion-time [rollup](#rollup) should be used or not (via `rollup`).
-
-Other than `rollup`, these operations are all based on the [primary timestamp](#primary-timestamp).
-
-An example `granularitySpec` is:
-
-```
-"granularitySpec": {
-  "segmentGranularity": "day",
-  "queryGranularity": "none",
-  "intervals": [
-    "2013-08-31/2013-09-01"
-  ],
-  "rollup": true
-}
-```
-
-A `granularitySpec` can have the following components:
-
-| Field | Description | Default |
-|-------|-------------|---------|
-| type | Either `uniform` or `arbitrary`. In most cases you want to use `uniform`.| `uniform` |
-| segmentGranularity | [Time chunking](../design/architecture.md#datasources-and-segments) granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to `day`, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any [granularity](../querying/granularities.md) can be provided here. Note that all segments in the same time chunk should have the same segment granularity.<br><br>Ignored if `type` is set to `arbitrary`.| `day` |
-| queryGranularity | The resolution of timestamp storage within each segment. This must be equal to, or finer, than `segmentGranularity`. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of `minute` will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).<br><br>Any [granularity](../querying/granularities.md) can be provided here. Use `none` to store timestamps as-is, without any truncation. Note that `rollup` will be applied if it is set even when the `queryGranularity` is set to `none`. | `none` |
-| rollup | Whether to use ingestion-time [rollup](#rollup) or not. Note that rollup is still effective even when `queryGranularity` is set to `none`. Your data will be rolled up if they have the exactly same timestamp. | `true` |
-| intervals | A list of intervals describing what time chunks of segments should be created. If `type` is set to `uniform`, this list will be broken up and rounded-off based on the `segmentGranularity`. If `type` is set to `arbitrary`, this list will be used as-is.<br><br>If `null` or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.<br><br>If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.<br><br>Ignored for any form of streaming ingestion. | `null` |
-
-### `transformSpec`
-
-The `transformSpec` is located in `dataSchema` → `transformSpec` and is responsible for transforming and filtering
-records during ingestion time. It is optional. An example `transformSpec` is:
-
-```
-"transformSpec": {
-  "transforms": [
-    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
-  ],
-  "filter": {
-    "type": "selector",
-    "dimension": "country",
-    "value": "San Serriffe"
-  }
-}
-```
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Transforms
-
-The `transforms` list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a
-"name" which can be referred to by your `dimensionsSpec`, `metricsSpec`, etc.
-
-If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that
-shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
-
-Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular,
-they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field
-with another field containing all nulls, which will act similarly to removing the field.
-
-Transforms can refer to the [timestamp](#timestampspec) of an input row by referring to `__time` as part of the expression.
-They can also _replace_ the timestamp if you set their "name" to `__time`. In both cases, `__time` should be treated as
-a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
-`timestampSpec`.
-
-Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:
-
-```
-{
-  "type": "expression",
-  "name": "<output name>",
-  "expression": "<expr>"
-}
-```
-
-The `expression` is a [Druid query expression](../misc/math-expr.md).
-
-> Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order:
-> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
-> and finally [`dimensionsSpec`](#dimensionsspec) and [`metricsSpec`](#metricsspec). Keep this in mind when writing
-> your ingestion spec.
-
-#### Filter
-
-The `filter` conditionally filters input rows during ingestion. Only rows that pass the filter will be
-ingested. Any of Druid's standard [query filters](../querying/filters.md) can be used. Note that within a
-`transformSpec`, the `transforms` are applied before the `filter`, so the filter can refer to a transform.
-
-### Legacy `dataSchema` spec
-
-> The `dataSchema` spec has been changed in 0.17.0. The new spec is supported by all ingestion methods
-except for _Hadoop_ ingestion. See [`dataSchema`](#dataschema) for the new spec.
-
-The legacy `dataSchema` spec has below two more components in addition to the ones listed in the [`dataSchema`](#dataschema) section above.
-
-- [input row parser](#parser-deprecated), [flattening of nested data](#flattenspec) (if needed)
-
-#### `parser` (Deprecated)
-
-In legacy `dataSchema`, the `parser` is located in the `dataSchema` → `parser` and is responsible for configuring a wide variety of
-items related to parsing input records. The `parser` is deprecated and it is highly recommended to use `inputFormat` instead.
-For details about `inputFormat` and supported `parser` types, see the ["Data formats" page](data-formats.md).
-
-For details about major components of the `parseSpec`, refer to their subsections:
-
-- [`timestampSpec`](#timestampspec), responsible for configuring the [primary timestamp](#primary-timestamp).
-- [`dimensionsSpec`](#dimensionsspec), responsible for configuring [dimensions](#dimensions).
-- [`flattenSpec`](#flattenspec), responsible for flattening nested data formats.
-
-An example `parser` is:
-
-```
-"parser": {
-  "type": "string",
-  "parseSpec": {
-    "format": "json",
-    "flattenSpec": {
-      "useFieldDiscovery": true,
-      "fields": [
-        { "type": "path", "name": "userId", "expr": "$.user.id" }
-      ]
-    },
-    "timestampSpec": {
-      "column": "timestamp",
-      "format": "auto"
-    },
-    "dimensionsSpec": {
-      "dimensions": [
-        "page",
-        "language",
-        { "type": "long", "name": "userId" }
-      ]
-    }
-  }
-}
-```
-
-#### `flattenSpec`
-
-In the legacy `dataSchema`, the `flattenSpec` is located in `dataSchema` → `parser` → `parseSpec` → `flattenSpec` and is responsible for
-bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model.
-See [Flatten spec](./data-formats.md#flattenspec) for more details.
-
-## `ioConfig`
-
-The `ioConfig` influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted
-filesystem, or any other supported source system. The `inputFormat` property applies to all
-[ingestion method](#ingestion-methods) except for Hadoop ingestion. The Hadoop ingestion still
-uses the [`parser`](#parser-deprecated) in the legacy `dataSchema`.
-The rest of `ioConfig` is specific to each individual ingestion method.
-An example `ioConfig` to read JSON data is:
-
-```json
-"ioConfig": {
-    "type": "<ingestion-method-specific type code>",
-    "inputFormat": {
-      "type": "json"
-    },
-    ...
-}
-```
-For more details, see the documentation provided by each [ingestion method](#ingestion-methods).
-
-## `tuningConfig`
-
-Tuning properties are specified in a `tuningConfig`, which goes at the top level of an ingestion spec. Some
-properties apply to all [ingestion methods](#ingestion-methods), but most are specific to each individual
-ingestion method. An example `tuningConfig` that sets all of the shared, common properties to their defaults
-is:
-
-```plaintext
-"tuningConfig": {
-  "type": "<ingestion-method-specific type code>",
-  "maxRowsInMemory": 1000000,
-  "maxBytesInMemory": <one-sixth of JVM memory>,
-  "indexSpec": {
-    "bitmap": { "type": "roaring" },
-    "dimensionCompression": "lz4",
-    "metricCompression": "lz4",
-    "longEncoding": "longs"
-  },
-  <other ingestion-method-specific properties>
-}
-```
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are `index`, `hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of `maxBytesInMemory` until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds `maxBytesInMemory`.<br /><br />Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).<br /><br />Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sk
 etches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check.|false|
-|indexSpec|Tune how data is indexed. See below for more information.|See table below|
-|Other properties|Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: [Kafka indexing service](../development/extensions-core/kafka-ingestion.md#tuningconfig), [Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md#tuningconfig), [Native batch](native-batch.md#tuningconfig), and [Hadoop-based](hadoop.md#tuningconfig).||
-
-#### `indexSpec`
-
-The `indexSpec` object can include the following properties:
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|bitmap|Compression format for bitmap indexes. Should be a JSON object with `type` set to `roaring` or `concise`. For type `roaring`, the boolean property `compressRunOnSerialization` (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient.|`{"type": "concise"}`|
-|dimensionCompression|Compression format for dimension columns. Options are `lz4`, `lzf`, or `uncompressed`.|`lz4`|
-|metricCompression|Compression format for primitive type metric columns. Options are `lz4`, `lzf`, `uncompressed`, or `none` (which is more efficient than `uncompressed`, but not supported by older versions of Druid).|`lz4`|
-|longEncoding|Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. `longs` stores the value as-is with 8 bytes each.|`longs`|
-
-Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each
-[ingestion method](#ingestion-methods) for details.
+| **[Rollup modes](rollup.md)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig). |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) for details.| Hash-based or range-based partitioning via [`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based partitioning methods are available. See [Partitions Spec](./native-batch.md#partitionsspec-1) for details. |

Review comment:
       Use consistent naming for anchors: partitionSpec vs "Partition Spec"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682128191



##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.
+
+The following table shows how each ingestion method handles partitioning:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
+|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
+|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
+|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how partitions the datasource.. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|

Review comment:
       ```suggestion
   |[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
   |[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how Druid partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
   ```
   
   Missing `Druid`. @vtlim 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683808120



##########
File path: docs/ingestion/data-formats.md
##########
@@ -521,8 +523,8 @@ An example `flattenSpec` is:
 }
 ```
 > Conceptually, after input data records are read, the `flattenSpec` is applied first before

Review comment:
       This is the section on flattenSpec and it goes in the inputFormat object. I couldn't find a better place to link.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686469331



##########
File path: docs/ingestion/data-formats.md
##########
@@ -422,11 +427,20 @@ Multiple Instances:
 
 ### Avro OCF
 
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro OCF input format.
+To load the Avro OCF input format, load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
 
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
+See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
 
-The `inputFormat` to load data of Avro OCF format. An example is:
+Configure the Avro OCF `inputFormat` to load Avro OCF data as follows:
+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `avro_ocf` to read Avro OCF file| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro records. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |

Review comment:
       I didn't touch this one, can we fix in another PR




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683722747



##########
File path: docs/ingestion/data-formats.md
##########
@@ -128,6 +128,8 @@ The CSV `inputFormat` has the following components:
 
 ### TSV (Delimited)
 
+The `inputFormat` to load data of a delimited format. An example is:

Review comment:
       An example for `inputForm` using JSON data is:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683701635



##########
File path: docs/ingestion/index.md
##########
@@ -22,33 +22,24 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
 
-## How to use this documentation
+This topic and the following topics describe ingestion concepts and information that apply to all [ingestion methods](#ingestion-methods):
+- [Druid data model](./data-model.md) introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+- [Data rollup](./rollup.md) describes rollup as a concept and provides suggestions to maximize the benefits of rollup.
+- [Partitioning](./partitioning.md) describes time chunk and secondary partitioning in Druid.
+- [Ingestion spec reference](./ingestion-spec.md) provides a reference for the configuration options in the ingestion spec.
 
-This **page you are currently reading** provides information about universal Druid ingestion concepts, and about
-configurations that are common to all [ingestion methods](#ingestion-methods).
-
-The **individual pages for each ingestion method** provide additional information about concepts and configurations
-that are unique to each ingestion method.
-
-We recommend reading (or at least skimming) this universal page first, and then referring to the page for the
-ingestion method or methods that you have chosen.
+For additional information about concepts and configurations that are unique to each ingestion method, see the topic for the ingestion method.

Review comment:
       Insert a paragraph: 
   
   "This topic also describe the two main ways of ingest data in Druid: streaming and batch"...blah blah blah
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on pull request #11541:
URL: https://github.com/apache/druid/pull/11541#issuecomment-896476278


   @sthetland , @suneet-s , I think I finally got all the merge conflicts, spelling, and link checking handled. What a challenge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683724404



##########
File path: docs/ingestion/data-formats.md
##########
@@ -521,8 +523,8 @@ An example `flattenSpec` is:
 }
 ```
 > Conceptually, after input data records are read, the `flattenSpec` is applied first before

Review comment:
       I did not see a section that explains "flattenSpec"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r686469265



##########
File path: docs/ingestion/data-formats.md
##########
@@ -215,21 +218,22 @@ The `inputFormat` to load data of Parquet format. An example is:
 }
 ```
 
-The Parquet `inputFormat` has the following components:
+### Avro Stream
 
-| Field | Type | Description | Required |
-|-------|------|-------------|----------|
-|type| String| This should be set to `parquet` to read Parquet file| yes |
-|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Parquet file. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
-| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
+To use the Avro Stream input format load the Druid Avro extension ([`druid-avro-extensions`](../development/extensions-core/avro.md)).
 
-### Avro Stream
+For more informtion on how Druid handles Avro types, see [Avro Types](../development/extensions-core/avro.md#avro-types) section for
 
-> You need to include the [`druid-avro-extensions`](../development/extensions-core/avro.md) as an extension to use the Avro Stream input format.
+Configure the Avro `inputFormat` to load Avro data as follows:
 
-> See the [Avro Types](../development/extensions-core/avro.md#avro-types) section for how Avro types are handled in Druid
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `avro_stream` to read Avro serialized data| yes |
+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |

Review comment:
       I didn't touch this. Can we fix it in another PR>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682739247



##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.

Review comment:
       ```suggestion
   You can use segment partitioning and sorting within your Druid datasources to reduce the size of your data and increase performance.
   ```
   Yah that was from the original doc :/ 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] loquisgon commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

loquisgon commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r683731587



##########
File path: docs/ingestion/data-formats.md
##########
@@ -521,8 +523,8 @@ An example `flattenSpec` is:
 }
 ```
 > Conceptually, after input data records are read, the `flattenSpec` is applied first before

Review comment:
       I found it after a while: [`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec) maybe insert an anchor to it?

##########
File path: docs/ingestion/data-formats.md
##########
@@ -521,8 +523,8 @@ An example `flattenSpec` is:
 }
 ```
 > Conceptually, after input data records are read, the `flattenSpec` is applied first before

Review comment:
       I found it after a while: [`flattenSpec`](../../ingestion/ingestion-spec.md#flattenspec) maybe insert an anchor to it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] techdocsmith commented on a change in pull request #11541: Docs Ingestion page refactor

Posted by GitBox <gi...@apache.org>.

techdocsmith commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682135293



##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. Regardless of the source field field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+You can control other important operations that are based on the primary timestamp

Review comment:
       ```suggestion
   You can control other important operations that are based on the primary timestamp in the
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org