You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/03 21:30:14 UTC

[GitHub] [druid] vtlim commented on a change in pull request #11541: Docs Ingestion page refactor

vtlim commented on a change in pull request #11541:
URL: https://github.com/apache/druid/pull/11541#discussion_r682074888



##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes start and monitor Hadoop jobs. 
 
-Once segments have been generated and stored in [deep storage](../dependencies/deep-storage.md), they are loaded by Historical processes. 
-For more details on how this works, see the [Storage design](../design/architecture.md#storage-design) section 
-of Druid's design documentation.
+After Druid creates segments have been generated and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.

Review comment:
       ```suggestion
   After Druid creates segments and stores them in [deep storage](../dependencies/deep-storage.md), Historical processes load them to respond to queries. See the [Storage design](../design/architecture.md#storage-design) section of the Druid design documentation for more information.
   ```

##########
File path: docs/ingestion/index.md
##########
@@ -22,29 +22,20 @@ title: "Ingestion"
   ~ under the License.
   -->
 
-All data in Druid is organized into _segments_, which are data files each of which may have up to a few million rows.
-Loading data in Druid is called _ingestion_ or _indexing_, and consists of reading data from a source system and creating
-segments based on that data.
+Loading data in Druid is called _ingestion_ or _indexing_. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called _segments_. In general, segment files contain a few million rows.
 
-In most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes
-(or the [Indexer](../design/indexer.md) processes) load your source data. One exception is
-Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer
-processes are still involved in starting and monitoring the Hadoop jobs). 
+For most ingestion methods, the Druid [MiddleManager](../design/middlemanager.md) processes or the [Indexer](../design/indexer.md) processes load your source data. One exception is
+Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes start and monitor Hadoop jobs. 

Review comment:
       ```suggestion
   Hadoop-based ingestion, which uses a Hadoop MapReduce job on YARN MiddleManager or Indexer processes to start and monitor Hadoop jobs. 
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.

Review comment:
       ```suggestion
   Druid stores data in datasources, which are similar to tables in a traditional relational database management system (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.

Review comment:
       So the user can use _either_ `timestampSpec` or `granularitySpec` as a primary timestamp but not both?

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio

Review comment:
       ```suggestion
       - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.

Review comment:
       ```suggestion
   ## Perfect rollup vs best-effort rollup
   
   Depending on the ingestion method, Druid has the following rollup options:
   - Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
   - _Best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.

Review comment:
       ```suggestion
   This topic describes how to set up partitions within a single datasource. It does not cover how to use multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.
+
+The following table shows how each ingestion method handles partitioning:
+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.md)|Configured using [`partitionsSpec`](native-batch.md#partitionsspec) inside the `tuningConfig`.|
+|[Hadoop](hadoop.md)|Configured using [`partitionsSpec`](hadoop.md#partitionsspec) inside the `tuningConfig`.|
+|[Kafka indexing service](../development/extensions-core/kafka-ingestion.md)|Kafka topic partitioning defines how partitions the datasource. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|
+|[Kinesis indexing service](../development/extensions-core/kinesis-ingestion.md)|Kinesis stream sharding defines how partitions the datasource.. You can also [reindex](data-management.md#reingesting-data) or [compact](compaction.md) to repartition after initial ingestion.|

Review comment:
       "defines how partitions the datasource" sounds unclear

##########
File path: docs/querying/multi-value-dimensions.md
##########
@@ -1,4 +1,4 @@
----
+  ---

Review comment:
       ```suggestion
   ---
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.

Review comment:
       phrasing seems a bit strange; another option could be "partitioning of and sorting segments" though not much better

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.

Review comment:
       ```suggestion
   Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade-off for the efficiency of rollup, you lose the ability to query individual events.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad-hoc manner.
+
+If you disable [rollup](./rollup.md), then Druid treats the set of
+dimensions like a set of columns to ingest. The dimensions behave exactly as you would expect from any database that does not support a rollup feature.
+
+At ingestion time, you configure dimensions in the [`dimensionsSpec`](./ingestion-spec.md#dimensionsspec).
+
+## Metrics
+
+Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you Specify a metric, you can apply an aggregation function to each row during ingestion. This

Review comment:
       ```suggestion
   Metrics are columns that Druid stores in an aggregated form. Metrics are most useful when you enable [rollup](rollup.md). If you specify a metric, you can apply an aggregation function to each row during ingestion. This
   ```

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
+
+## How to configure partitioning
+
+Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. If you are doing initial ingestion through a less-flexible method like
+Kafka), you can use [reindexing](data-management.md#reingesting-data) or [compaction](compaction.md) to repartition your data after initial ingestion. This is a powerful technique you can use to optimally partition any data older than a certain even while you continuously add new data from a stream.

Review comment:
       "older than a certain **event**" or "older than a certain **X**, even while" ?

##########
File path: docs/ingestion/partitioning.md
##########
@@ -0,0 +1,50 @@
+---
+id: partitioning
+title: Partitioning
+sidebar_label: Partitioning
+description: Describes time chunk and secondary partitioning in Druid. Provides guidance to choose a secondary partition dimension.
+---
+
+Optimal partitioning and sorting of segments within your Druid datasources can have substantial impact on footprint and performance.
+
+One way to partition is to your load data into separate datasources. This is a perfectly viable approach that works very well when the number of datasources does not lead to excessive per-datasource overheads. 
+
+This topic describes how to set up partitions within a single datasource. It does not cover using multiple datasources. See [Multitenancy considerations](../querying/multitenancy.md) for more details on splitting data into separate datasources and potential operational considerations.
+
+## Time chunk partitioning
+
+Druid always partitions datasources by time into _time chunks_. Each time chunk contains one or more segments. This partitioning happens for all ingestion methods based on the `segmentGranularity` parameter in your ingestion spec `dataSchema` object.
+
+## Secondary partitioning
+
+Druid can partition segments within a particular time chunk further depending upon options that vary based on the ingestion type you have chosen. In general, secondary partitioning on a particular dimension improves locality. This means that rows with the same value for that dimension are stored together, decreasing access time.
+
+To achieve the best performance and smallest overall footprint, partition your data on a "natural"
+dimension that you often use as a filter when possible. Such partitioning often improves compression and query performance. For example, some cases have yielded threefold storage size decreases.
+
+## Partitioning and sorting
+
+Partitioning and sorting work well together. If you do have a "natural" partitioning dimension, consider placing it first in the `dimensions` list of your `dimensionsSpec`. This way Druid sorts rows within each segment by that column. This sorting configuration frequently improves compression more than using partitioning alone.
+
+> Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, and letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.

Review comment:
       ```suggestion
   > Note that Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your `dimensionsSpec`. This sorting can preclude the efficacy of dimension sorting. To work around this limitation if necessary, set your `queryGranularity` equal to `segmentGranularity` in your [`granularitySpec`](./ingestion-spec.md#granularityspec). Druid will set all timestamps within the segment to the same value, letting you identify a [secondary timestamp](schema-design.md#secondary-timestamps) as the "real" timestamp.
   ```

##########
File path: docs/ingestion/data-model.md
##########
@@ -0,0 +1,38 @@
+---
+id: data-model
+title: "Druid data model"
+sidebar_label: Data model
+description: Introduces concepts of datasources, primary timestamp, dimensions, and metrics.
+---
+
+Druid stores data in datasources, which are similar to tables in a traditional relational database management systems (RDBMS). Druid's data model shares  similarities with both relational and timeseries data models.
+
+## Primary timestamp
+
+Druid schemas must always include a primary timestamp. Druid uses the primary timestamp to [partition and sort](./partitioning.md) your data. Druid uses the primary timestamp to rapidly identify and retrieve data within the time range of queries. Druid also uses the primary timestamp column
+for time-based [data management operations](./data-management.md) such as dropping time chunks, overwriting time chunks, and time-based retention rules.
+
+Druid parses the primary timestamp based on the [`timestampSpec`](./ingestion-spec.md#timestampspec) configuration at ingestion time. You can control other important operations that are based on the primary timestamp
+[`granularitySpec`](./ingestion-spec.md#granularityspec). Regardless of the source input field for the primary timestamp, Druid always stores the timestamp in the `__time` column in your Druid datasource.
+
+If you have more than one timestamp column, you can store the others as
+[secondary timestamps](./schema-design.md#secondary-timestamps).
+
+## Dimensions
+
+Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad-hoc manner.

Review comment:
       ```suggestion
   Dimensions are columns that Druid stores "as-is". You can use dimensions for any purpose. For example, you can group, filter, or apply aggregators to dimensions at query time in an ad hoc manner.
   ```

##########
File path: docs/ingestion/rollup.md
##########
@@ -0,0 +1,61 @@
+---
+id: rollup
+title: "Data rollup"
+sidebar_label: Data rollup
+description: Introduces rollup as a concept. Provides suggestions to maximize the benefits of rollup. Differentiates between perfect and best-effort rollup.
+---
+Druid can roll up data at ingestion time to reduce the amount of raw data to  store on disk. Rollup is a form of summarization or pre-aggregation. Rolling up data can dramatically reduce the size of data to be stored and reduce row counts by potentially orders of magnitude. As a trade off for the efficiency of rollup, you lose the ability to query individual events.
+
+At ingestion time, you control rollup with the `rollup` setting in the [`granularitySpec`](./ingestion-spec.md#granularityspec). Rollup is enabled by default. This means Druid combines into a single row any rows that have identical [dimension](./data-model.md#dimensions) values and [timestamp](./data-model.md#primary-timestamp) values after [`queryGranularity`-based truncation](./ingestion-spec.md#granularityspec).
+
+When you disable rollup, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to databases that do not support a rollup feature. Set `rollup` to `false` if you want Druid to store each record as-is, without any rollup summarization.
+
+## Maximizing rollup ratio
+
+To measure the rollup ratio of a datasource, compare the number of rows in Druid with the number of ingested events. The higher this result, the more benefit you are gaining from rollup. For example you can run the following [Druid SQL](../querying/sql.md) query after ingestion:
+
+```sql
+SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource
+```
+
+In this query, `cnt` refers to a "count" type metric from your ingestion spec. See
+[Counting the number of ingested events](schema-design.md#counting) on the "Schema design" page for more details about how counting works when rollup is enabled.
+
+Tips for maximizing rollup:
+
+- Design your schema with fewer dimensions and lower cardinality dimensions to yield better rollup ratios.
+- Use [sketches](schema-design.md#sketches) to avoid storing high cardinality dimensions, which decrease rollup ratios.
+- Adjust your `queryGranularity` at ingestion time to increase the chances that multiple rows in Druid having matching timestamps. For example, use five minute query granularity (`PT5M`) instead of one minute (`PT1M`).
+- You can optionally load the same data into more than one Druid datasource. For example:
+    - Create a "full" datasource that has rollup disabled, or enabled, but with a minimal rollup ratio
+    - Create a second "abbreviated" datasource with fewer dimensions and a higher rollup ratio.
+     When queries only involve dimensions in the "abbreviated" set, use the second datasource to reduce query times. Often, this method only requires a small increase in storage footprint because abbreviated datasources tend to be substantially smaller.
+- If you use a [best-effort rollup](#perfect-rollup-vs-best-effort-rollup) ingestion configuration that does not guarantee perfect rollup, try one of the following:
+    - Switch to a guaranteed perfect rollup option.
+    - [Reindex](data-management.md#reingesting-data) or [compact](compaction.md) your data in the background after initial ingestion.
+
+## Perfect rollup vs Best-effort rollup
+
+Depending on the ingestion method, Druid has the following rollup options:
+- Guaranteed _perfect rollup_: Druid perfectly aggregates input data at ingestion time.
+- _best-effort rollup_: Druid may not perfectly aggregate input data. Therefore, multiple segments might contain rows with the same timestamp and dimension values.
+
+In general, ingestion methods that offer best-effort rollup do this for one of the following reasons:
+- The ingestion method parallelizes ingestion without a shuffling step required for perfect rollup.
+- The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received,

Review comment:
       ```suggestion
   - The ingestion method uses _incremental publishing_ which means it finalizes and publishes segments before all data for a time chunk has been received.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org