You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/12 02:48:06 UTC

[GitHub] [druid] gianm opened a new pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

gianm opened a new pull request #10267:
URL: https://github.com/apache/druid/pull/10267


   DruidInputSource, DruidSegmentReader changes:
   
   1) Remove "dimensions" and "metrics". They are not necessary, because we
      can compute which columns we need to read based on what is going to
      be used by the timestamp, transform, dimensions, and metrics.
   2) Start using ColumnsFilter (see below) to decide which columns we need
      to read.
   3) Actually respect the "timestampSpec". Previously, it was ignored, and
      the timestamp of the returned InputRows was set to the `__time` column
      of the input datasource.
   
   (1) and (2) together fix a bug in which the DruidInputSource would not
   properly read columns that are used as inputs to a transformSpec.
   
   (3) fixes a bug where the timestampSpec would be ignored if you attempted
   to set the column to something other than `__time`.
   
   (1) and (3) are breaking changes.
   
   Web console changes:
   
   1) Remove "Dimensions" and "Metrics" from the Druid input source.
   2) Set timestampSpec to `{"column": "__time", "format": "millis"}` for
      compatibility with the new behavior.
   
   Other changes:
   
   1) Add ColumnsFilter, a new class that allows input readers to determine
      which columns they need to read. Currently, it's only used by the
      DruidInputSource, but it could be used by other columnar input sources
      in the future.
   2) Add a ColumnsFilter to InputRowSchema.
   3) Remove the metric names from InputRowSchema (they were unused).
   4) Add InputRowSchemas.fromDataSchema method that computes the proper
      ColumnsFilter for given timestamp, dimensions, transform, and metrics.
   5) Add "getRequiredColumns" method to TransformSpec to support the above.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600972332



##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.

Review comment:
       Good idea. I added this.
   
   ```
   Alternatively, if your goals can be satisfied by [compaction](compaction.md), consider that instead as a simpler
   approach.
   ```

##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.

Review comment:
       Good idea. I added this.
   
   ```
   Alternatively, if your goals can be satisfied by [compaction](compaction.md),
   consider that instead as a simpler approach.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-807149188


   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600962968



##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.
+
+An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new
+rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and "page".
 
 ```json
-...
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia_rollup",
+      "timestampSpec": {
+        "column": "__time",
+        "format": "millis"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          "countryName",
+          "page"
+        ]
+      },
+      "metricsSpec": [
+        {
+          "type": "count",
+          "name": "cnt"
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "queryGranularity": "HOUR",
+        "segmentGranularity": "DAY",
+        "intervals": ["2016-06-27/P1D"],
+        "rollup": true
+      }
+    },
     "ioConfig": {
       "type": "index_parallel",
       "inputSource": {
         "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02",
-        "dimensions": [
-          "page",
-          "user"
-        ],
-        "metrics": [
-          "added"
-        ],
-        "filter": {
-          "type": "selector",
-          "dimension": "page",
-          "value": "Druid"
-        }
+        "dataSource": "wikipedia_raw",
+        "interval": "2016-06-27/P1D"
       }
-      ...
     },
-...
+    "tuningConfig": {
+      "type": "index_parallel",
+      "partitionsSpec": {
+        "type": "hashed",
+        "numShards": 1
+      },
+      "forceGuaranteedRollup": true,
+      "maxNumConcurrentSubTasks": 1

Review comment:
       What I was thinking was:
   
   - I want to include a full ingest spec, not just the inputSource part, so people have a full example.
   - This spec uses rollup, so for a reindexing spec, it'd be good to use a partitionsSpec that guarantees rollup too.
   
   Do you have a better suggestion for what to put in the example?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vogievetsky commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

vogievetsky commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-740308404


   The web console changes (*.tsx) look good to me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r470879141



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/InputRowSchemas.java
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.input;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.druid.data.input.ColumnsFilter;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.impl.DimensionsSpec;
+import org.apache.druid.data.input.impl.TimestampSpec;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import org.apache.druid.segment.indexing.DataSchema;
+import org.apache.druid.segment.transform.Transform;
+import org.apache.druid.segment.transform.TransformSpec;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utilities that are helpful when implementing {@link org.apache.druid.data.input.InputEntityReader}.
+ */
+public class InputRowSchemas
+{
+  private InputRowSchemas()
+  {
+    // No instantiation.
+  }
+
+  /**
+   * Creates an {@link InputRowSchema} from a given {@link DataSchema}.
+   */
+  public static InputRowSchema fromDataSchema(final DataSchema dataSchema)
+  {
+    return new InputRowSchema(
+        dataSchema.getTimestampSpec(),
+        dataSchema.getDimensionsSpec(),
+        createColumnsFilter(
+            dataSchema.getTimestampSpec(),
+            dataSchema.getDimensionsSpec(),
+            dataSchema.getTransformSpec(),
+            dataSchema.getAggregators()
+        )
+    );
+  }
+
+  /**
+   * Build a {@link ColumnsFilter} that can filter down the list of columns that must be read after flattening.
+   *
+   * @see InputRowSchema#getColumnsFilter()
+   */
+  @VisibleForTesting
+  static ColumnsFilter createColumnsFilter(

Review comment:
       thought it was a good method 👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-697687585


   I'm having a tough time figuring out why the integration tests aren't passing. I suppose it's related to the fact that I added testing there for the new `druid.indexer.task.ignoreTimestampSpecForDruidInputSource` property, but, I built a copy locally and tested that property and it works OK. So there might be something else going on.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm edited a comment on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm edited a comment on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-682031792


   > The new config LGTM, there are some CI failures
   
   @jon-wei thanks; I can't see the CI failures right now due to merge conflicts, so I just fixed them and pushed them up. I'll take another look once CI runs again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-682031792


   > The new config LGTM, there are some CI failures
   
   @jon-wei thanks; I can't see the CI failures right now due to merge conflicts, so I just fixed them and pushed them up. I'll take another look in a bit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-672529735


   Btw, this fixes #10266 too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600981166



##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.
+
+An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new
+rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and "page".
 
 ```json
-...
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia_rollup",
+      "timestampSpec": {
+        "column": "__time",
+        "format": "millis"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          "countryName",
+          "page"
+        ]
+      },
+      "metricsSpec": [
+        {
+          "type": "count",
+          "name": "cnt"
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "queryGranularity": "HOUR",
+        "segmentGranularity": "DAY",
+        "intervals": ["2016-06-27/P1D"],
+        "rollup": true
+      }
+    },
     "ioConfig": {
       "type": "index_parallel",
       "inputSource": {
         "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02",
-        "dimensions": [
-          "page",
-          "user"
-        ],
-        "metrics": [
-          "added"
-        ],
-        "filter": {
-          "type": "selector",
-          "dimension": "page",
-          "value": "Druid"
-        }
+        "dataSource": "wikipedia_raw",
+        "interval": "2016-06-27/P1D"
       }
-      ...
     },
-...
+    "tuningConfig": {
+      "type": "index_parallel",
+      "partitionsSpec": {
+        "type": "hashed",
+        "numShards": 1
+      },
+      "forceGuaranteedRollup": true,
+      "maxNumConcurrentSubTasks": 1

Review comment:
       OK, I'll do that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600724563



##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.

Review comment:
       Maybe good to suggest using auto compaction here instead of writing an ingestion spec.

##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/DruidSegmentReader.java
##########
@@ -122,8 +147,16 @@
   @Override
   protected List<InputRow> parseInputRows(Map<String, Object> intermediateRow) throws ParseException
   {
-    final DateTime timestamp = (DateTime) intermediateRow.get(ColumnHolder.TIME_COLUMN_NAME);
-    return Collections.singletonList(new MapBasedInputRow(timestamp.getMillis(), dimensions, intermediateRow));
+    return Collections.singletonList(
+        MapInputRowParser.parse(
+            new InputRowSchema(

Review comment:
       Would be better to cache `inputRowSchema` since this function is called per row.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.
+
+An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new
+rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and "page".
 
 ```json
-...
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia_rollup",
+      "timestampSpec": {
+        "column": "__time",
+        "format": "millis"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          "countryName",
+          "page"
+        ]
+      },
+      "metricsSpec": [
+        {
+          "type": "count",
+          "name": "cnt"
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "queryGranularity": "HOUR",
+        "segmentGranularity": "DAY",
+        "intervals": ["2016-06-27/P1D"],
+        "rollup": true
+      }
+    },
     "ioConfig": {
       "type": "index_parallel",
       "inputSource": {
         "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02",
-        "dimensions": [
-          "page",
-          "user"
-        ],
-        "metrics": [
-          "added"
-        ],
-        "filter": {
-          "type": "selector",
-          "dimension": "page",
-          "value": "Druid"
-        }
+        "dataSource": "wikipedia_raw",
+        "interval": "2016-06-27/P1D"
       }
-      ...
     },
-...
+    "tuningConfig": {
+      "type": "index_parallel",
+      "partitionsSpec": {
+        "type": "hashed",
+        "numShards": 1
+      },
+      "forceGuaranteedRollup": true,
+      "maxNumConcurrentSubTasks": 1

Review comment:
       This part was not in the previous example. Was it intentional to use `hashed` partitionsSpec here? Seems unnecessary to me.

##########
File path: integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java
##########
@@ -129,7 +129,7 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception
           fullDatasourceName,
           AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING,
           0,
-          22482,
+          22481,

Review comment:
       Do you know why this changed?

##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java
##########
@@ -160,6 +184,7 @@ public String getDataSource()
 
   @Nullable
   @JsonProperty
+  @JsonInclude(Include.NON_NULL)

Review comment:
       Is it better to add this annotation at the class-level? Seems reasonable to not include any fields in JSON if they are null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-756633682


   Fixed up some merge conflicts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-697969629


   @gianm I saw some messages like this in the perfect rollup test: https://travis-ci.org/github/apache/druid/jobs/729522884
   
   > {"host":"172.172.172.7","port":8102,"tlsPort":8103},"dataSource":"wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?","errorMsg":"java.util.concurrent.ExecutionException: java.lang.ClassCastException: org.apache.druid.segment.incr..."},{"id":"partial_index_generate_wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?_bajijpif_2020-09-23T07:54:13.384Z","groupId":"index_parallel_wikipedia_parallel_druid_input_source_index_test Россия 한국 中国!?_pgofimno_2020-09-23T07:52:11.488Z","type":"partial_index_generate","createdTime":"2020-09-23T07:54:13.453Z","queueInsertionTime":"1970-01-01T00:00:00.000Z","statusCode":"FAILED","status":"FAILED","runnerStatusCode":"NONE","duration":55228,"location"
   
   I didn't see the ClassCastException in the log for the failed batch test though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm merged pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm merged pull request #10267:
URL: https://github.com/apache/druid/pull/10267


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600967002



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java
##########
@@ -129,7 +129,7 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception
           fullDatasourceName,
           AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING,
           0,
-          22482,
+          22481,

Review comment:
       I guessed it was because the metadata changed. By that, I mean the `org.apache.druid.segment.Metadata` object stored in the segment, which contains the TimestampSpec.
   
   It adds up, I think, since -1 character is the difference between the old default `timestamp` + `auto` (13 chars) and the new default `__time` + `millis` (12 chars).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r537984770



##########
File path: docs/configuration/index.md
##########
@@ -1249,6 +1249,7 @@ Additional peon configs include:
 |`druid.indexer.task.gracefulShutdownTimeout`|Wait this long on middleManager restart for restorable tasks to gracefully exit.|PT5M|
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
+|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch.md#druid-input-source) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.20.0.|false|

Review comment:
       I'll change it to 0.21.0, in the guess that this will be the next release.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r469276075



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java
##########
@@ -87,13 +91,21 @@
   @Nullable
   private final List<WindowedSegmentId> segmentIds;
   private final DimFilter dimFilter;

Review comment:
       It's possible to specify a filter alongside transforms today! You can do it in two places:
   
   - In the `transformSpec` (this works with any input source / format, see https://druid.apache.org/docs/latest/ingestion/index.html#filter)
   - In the druid `inputSource` itself (of course, only works with this input source)
   
   It's a little silly to have both, perhaps, but there's a practical reason: specifying a filter in the druid `inputSource` is faster, because it is applied while creating the cursor that reads the data, and therefore it can use indexes, etc. The filter in the `transformSpec` is applied _after_ the cursor generates rows.
   
   But I think in the future, it'd be better to support pushing down the `transformSpec` filter into the cursor, and then we could deprecate the filter parameter in the `inputSource`, because it wouldn't be useful anymore.
   
   For now, I suggest we leave it as-is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600964299



##########
File path: docs/ingestion/native-batch.md
##########
@@ -1282,60 +1282,81 @@ no `inputFormat` field needs to be specified in the ingestion spec when using th
 |type|This should be "druid".|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time range to fetch the data over.|yes|
-|dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no|
-|metrics|The list of Strings containing the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no|
 |filter| See [Filters](../querying/filters.md). Only rows that match the filter, if specified, will be returned.|no|
 
-A minimal example DruidInputSource spec is shown below:
+The Druid input source can be used for a variety of purposes, including:
 
-```json
-...
-    "ioConfig": {
-      "type": "index_parallel",
-      "inputSource": {
-        "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02"
-      }
-      ...
-    },
-...
-```
+- Creating new datasources that are rolled-up copies of existing datasources.
+- Changing the [partitioning or sorting](index.md#partitioning) of a datasource to improve performance.
+- Updating or removing rows using a [`transformSpec`](index.md#transformspec).
 
-The spec above will read all existing dimension and metric columns from
-the `wikipedia` datasource, including all rows with a timestamp (the `__time` column)
-within the interval `2013-01-01/2013-01-02`.
+When using the Druid input source, the timestamp column shows up as a numeric field named `__time` set to the number
+of milliseconds since the epoch (January 1, 1970 00:00:00 UTC). It is common to use this in the timestampSpec, if you
+want the output timestamp to be equivalent to the input timestamp. In this case, set the timestamp column to `__time`
+and the format to `auto` or `millis`.
 
-A spec that applies a filter and reads a subset of the original datasource's columns is shown below.
+It is OK for the input and output datasources to be the same. In this case, newly generated data will overwrite the
+previous data for the intervals specified in the `granularitySpec`. Generally, if you are going to do this, it is a
+good idea to test out your reindexing by writing to a separate datasource before overwriting your main one.
+
+An example task spec is shown below. It reads from a hypothetical raw datasource `wikipedia_raw` and creates a new
+rolled-up datasource `wikipedia_rollup` by grouping on hour, "countryName", and "page".
 
 ```json
-...
+{
+  "type": "index_parallel",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "wikipedia_rollup",
+      "timestampSpec": {
+        "column": "__time",
+        "format": "millis"
+      },
+      "dimensionsSpec": {
+        "dimensions": [
+          "countryName",
+          "page"
+        ]
+      },
+      "metricsSpec": [
+        {
+          "type": "count",
+          "name": "cnt"
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "queryGranularity": "HOUR",
+        "segmentGranularity": "DAY",
+        "intervals": ["2016-06-27/P1D"],
+        "rollup": true
+      }
+    },
     "ioConfig": {
       "type": "index_parallel",
       "inputSource": {
         "type": "druid",
-        "dataSource": "wikipedia",
-        "interval": "2013-01-01/2013-01-02",
-        "dimensions": [
-          "page",
-          "user"
-        ],
-        "metrics": [
-          "added"
-        ],
-        "filter": {
-          "type": "selector",
-          "dimension": "page",
-          "value": "Druid"
-        }
+        "dataSource": "wikipedia_raw",
+        "interval": "2016-06-27/P1D"
       }
-      ...
     },
-...
+    "tuningConfig": {
+      "type": "index_parallel",
+      "partitionsSpec": {
+        "type": "hashed",
+        "numShards": 1
+      },
+      "forceGuaranteedRollup": true,
+      "maxNumConcurrentSubTasks": 1

Review comment:
       I see. It makes sense. If that's the case, I would suggest simply removing `numShards` from the spec. The parallel task will find the `numShards` automatically based on `targetRowsPerSegment` which is 5 million by default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-769974098


   Fixed a conflict and updated the docs to say "before Druid 0.22.0" instead of "before Druid 0.21.0".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vogievetsky commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

vogievetsky commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r537971385



##########
File path: docs/configuration/index.md
##########
@@ -1313,6 +1314,7 @@ then the value from the configuration below is used:
 |`druid.indexer.task.gracefulShutdownTimeout`|Wait this long on Indexer restart for restorable tasks to gracefully exit.|PT5M|
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
+|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch.md#druid-input-source) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.20.0.|false|

Review comment:
       ditto

##########
File path: docs/configuration/index.md
##########
@@ -1249,6 +1249,7 @@ Additional peon configs include:
 |`druid.indexer.task.gracefulShutdownTimeout`|Wait this long on middleManager restart for restorable tasks to gracefully exit.|PT5M|
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
+|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks using the [Druid input source](../ingestion/native-batch.md#druid-input-source) will ignore the provided timestampSpec, and will use the `__time` column of the input datasource. This option is provided for compatibility with ingestion specs written before Druid 0.20.0.|false|

Review comment:
       should be 0.20.1 at this point




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r470653300



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/InputRowSchemas.java
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.input;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.druid.data.input.ColumnsFilter;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.impl.DimensionsSpec;
+import org.apache.druid.data.input.impl.TimestampSpec;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import org.apache.druid.segment.indexing.DataSchema;
+import org.apache.druid.segment.transform.Transform;
+import org.apache.druid.segment.transform.TransformSpec;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utilities that are helpful when implementing {@link org.apache.druid.data.input.InputEntityReader}.
+ */
+public class InputRowSchemas
+{
+  private InputRowSchemas()
+  {
+    // No instantiation.
+  }
+
+  /**
+   * Creates an {@link InputRowSchema} from a given {@link DataSchema}.
+   */
+  public static InputRowSchema fromDataSchema(final DataSchema dataSchema)
+  {
+    return new InputRowSchema(
+        dataSchema.getTimestampSpec(),
+        dataSchema.getDimensionsSpec(),
+        createColumnsFilter(
+            dataSchema.getTimestampSpec(),
+            dataSchema.getDimensionsSpec(),
+            dataSchema.getTransformSpec(),
+            dataSchema.getAggregators()
+        )
+    );
+  }
+
+  /**
+   * Build a {@link ColumnsFilter} that can filter down the list of columns that must be read after flattening.
+   *
+   * @see InputRowSchema#getColumnsFilter()
+   */
+  @VisibleForTesting
+  static ColumnsFilter createColumnsFilter(

Review comment:
       (What's the thumbs up for?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-674098449

> (1) sounds good to me.

Hmm, thinking about this some more, it may be best to _not_ have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying `dimensions` and `metrics` manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.

> LGTM, for the (3) backwards compatibility item, what makes sense to me if we decide to feature flag this (I'm not sure that's worth doing, maybe enough to just show a clear error message and call this out in the release notes), would be to have the flag control whether we ignore (old mode) or respect the timestampSpec, and in the new mode we could have that __time with auto/millis format check.

I think it makes sense to add a flag, because of the potential for confusion around why something suddenly broke in a subtle way after an upgrade. Usually (hopefully) cluster operators read the release notes, but not all users will read them, and the people submitting indexing tasks might not be the same people that operate the cluster.

In new mode, I suggest we skip the check, because that will enable the full power of timestampSpec to be used (you could use it to switch a secondary timestamp to primary, for example).

IMO the default should be new mode, but we should put something in the release notes that says if you have a bunch of users that might be relying on the old behavior, you can set this flag and get the old behavior back.

I'd also consider adding a logged warning if the timestampSpec is anything other than `__time` + `auto` / `millis`, just to let people know that if they didn't intend to be in a special case, they should change their timestampSpec.

What do you think?

Separately — as a reviewer — would you prefer these changes to be made in this patch, or in a follow-up? I'm OK either way…

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600979999



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java
##########
@@ -160,6 +184,7 @@ public String getDataSource()
 
   @Nullable
   @JsonProperty
+  @JsonInclude(Include.NON_NULL)

Review comment:
       Interesting question. To answer it, I had to add some tests to make sure it worked properly. The answer is yes, it does work. I'll make the change and keep the new tests (look for them in DruidInputSourceTest).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-673733953


   > For (1) we could throw an error in situations where people explicitly specify dimensions and metrics that don't match the computed inputs, informing them that these parameters no longer have an effect and asking them to remove them.
   
   (1) sounds good to me.
   
   > For (3) we could introduce a flag that causes an error to be thrown when you set the timestamp spec to anything other than {"column": "__time", "format": "auto"} or {"column": "__time", "format": "millis"}. Druid would invite you to set the flag, which would tell it "yes I really want the new, more correct behavior". That feature flag could be on or off by default.
   
   I think I'm a little unclear on (3), is the feature flag controlling whether to use the old/new behavior or is it just for whether the auto/millis check is executed? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-806292555


   Thanks for the review @jihoonson. I've pushed up the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm edited a comment on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm edited a comment on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-679860018


   @jon-wei I've pushed a new commit.
   
   > > Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.
   > 
   > That sounds fine to me too, I don't really have a strong preference on that.
   
   For this one, I left it as not-an-error.
   
   > Ah, I was thinking that the auto/millis check in the new mode would only apply if the column being referenced was `__time`, so that users could still use alternate timestamp columns.
   > 
   > The warning log sounds fine to me.
   
   For this one, I added a config `druid.indexer.task.ignoreTimestampSpecForDruidInputSource` that defaults off (false). If enabled, timestampSpecs are ignored with the 'druid' input source, and a warning is logged each time a reader is created.
   
   Either way, a warning is logged if you read from the `__time` column with a format other than `millis` or `auto`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] vogievetsky commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

vogievetsky commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-740939223


   thank you for updating the docs per my ask


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-738621133


   > I didn't see the ClassCastException in the log for the failed batch test though.
   
   I finally ran this locally and was able to extract the full error. I think it's happening because `__time` is getting added to the dimensions list of an input row. (This makes a string dimension named `__time` get created, which violates all sorts of assumptions.)
   
   We'll need to prevent that from happening somehow. I'll look into it.
   
   ```
   Caused by: java.lang.ClassCastException: org.apache.druid.segment.incremental.IncrementalIndexRowHolder cannot be cast to org.apache.druid.segment.DimensionSelector
           at org.apache.druid.segment.StringDimensionIndexer.convertUnsortedValuesToSorted(StringDimensionIndexer.java:793) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.incremental.IncrementalIndexRowIterator.lambda$makeRowPointer$0(IncrementalIndexRowIterator.java:78) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_232]
           at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[?:1.8.0_232]
           at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_232]
           at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_232]
           at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_232]
           at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:546) ~[?:1.8.0_232]
           at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260) ~[?:1.8.0_232]
           at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:505) ~[?:1.8.0_232]
           at org.apache.druid.segment.incremental.IncrementalIndexRowIterator.makeRowPointer(IncrementalIndexRowIterator.java:80) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.incremental.IncrementalIndexRowIterator.<init>(IncrementalIndexRowIterator.java:54) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.incremental.IncrementalIndexAdapter.getRows(IncrementalIndexAdapter.java:162) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.IndexMergerV9.makeMergedTimeAndDimsIterator(IndexMergerV9.java:1106) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:255) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.IndexMergerV9.merge(IndexMergerV9.java:999) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.IndexMergerV9.persist(IndexMergerV9.java:864) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.IndexMergerV9.persist(IndexMergerV9.java:833) ~[druid-processing-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.persistHydrant(AppenderatorImpl.java:1334) ~[druid-server-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.access$100(AppenderatorImpl.java:102) ~[druid-server-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
           at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl$2.call(AppenderatorImpl.java:547) ~[druid-server-0.21.0-SNAPSHOT.jar:0.21.0-SNAPSHOT]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-739745121


   OK, got it all sorted out. The tests are passing now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-674289603


   > Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.
   
   That sounds fine to me too, I don't really have a strong preference on that.
   
   > In new mode, I suggest we skip the check, because that will enable the full power of timestampSpec to be used (you could use it to switch a secondary timestamp to primary, for example).
   
   > I'd also consider adding a logged warning if the timestampSpec is anything other than __time + auto / millis, just to let people know that if they didn't intend to be in a special case, they should change their timestampSpec.
   
   Ah, I was thinking that the auto/millis check in the new mode would only apply if the column being referenced was `__time`, so that users could still use alternate timestamp columns. 
   
   The warning log sounds fine to me.
   
   
   > Separately — as a reviewer — would you prefer these changes to be made in this patch, or in a follow-up? I'm OK either way…
   
   Let's do it in this patch, the additions I'm guessing won't be too large and we can have everything related in one place.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600967002



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java
##########
@@ -129,7 +129,7 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception
           fullDatasourceName,
           AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING,
           0,
-          22482,
+          22481,

Review comment:
       I guessed it was because the metadata changed.
   
   It adds up, I think, since -1 character is the difference between the old default `timestamp` + `auto` (13 chars) and the new default `__time` + `millis` (12 chars).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r470372282



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/InputRowSchemas.java
##########
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.input;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.druid.data.input.ColumnsFilter;
+import org.apache.druid.data.input.InputRowSchema;
+import org.apache.druid.data.input.impl.DimensionsSpec;
+import org.apache.druid.data.input.impl.TimestampSpec;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import org.apache.druid.segment.indexing.DataSchema;
+import org.apache.druid.segment.transform.Transform;
+import org.apache.druid.segment.transform.TransformSpec;
+
+import java.util.HashSet;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Utilities that are helpful when implementing {@link org.apache.druid.data.input.InputEntityReader}.
+ */
+public class InputRowSchemas
+{
+  private InputRowSchemas()
+  {
+    // No instantiation.
+  }
+
+  /**
+   * Creates an {@link InputRowSchema} from a given {@link DataSchema}.
+   */
+  public static InputRowSchema fromDataSchema(final DataSchema dataSchema)
+  {
+    return new InputRowSchema(
+        dataSchema.getTimestampSpec(),
+        dataSchema.getDimensionsSpec(),
+        createColumnsFilter(
+            dataSchema.getTimestampSpec(),
+            dataSchema.getDimensionsSpec(),
+            dataSchema.getTransformSpec(),
+            dataSchema.getAggregators()
+        )
+    );
+  }
+
+  /**
+   * Build a {@link ColumnsFilter} that can filter down the list of columns that must be read after flattening.
+   *
+   * @see InputRowSchema#getColumnsFilter()
+   */
+  @VisibleForTesting
+  static ColumnsFilter createColumnsFilter(

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r469190573



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java
##########
@@ -87,13 +91,21 @@
   @Nullable
   private final List<WindowedSegmentId> segmentIds;
   private final DimFilter dimFilter;

Review comment:
       will it make sense to move DimFilter outside the InputSource in the task json? It seems more natural to me to put the filters alongside transforms, dimensions, and metrics and leave only the data source properties inside the `InputSource` section. On the flip side, it could make the compatibility situation more complicated than it is. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jon-wei commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jon-wei commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-680400922


   The new config LGTM, there are some CI failures 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-738650486


   > We'll need to prevent that from happening somehow. I'll look into it.
   
   I pushed up a new commit that addresses this case by adding `__time` to the list of dimension exclusions. The relevant part of the patch is here: https://github.com/apache/druid/pull/10267/commits/be8a38950524036e25013e52088405fd3eaeb58e#diff-a0d939956afdb61e5b14e8feba35e6abf277f7dbb1ed5c897a8cf689767ea572R156-R193


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-756633682


   Fixed up some merge conflicts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on a change in pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

jihoonson commented on a change in pull request #10267:
URL: https://github.com/apache/druid/pull/10267#discussion_r600988766



##########
File path: integration-tests/src/test/java/org/apache/druid/tests/coordinator/duty/ITAutoCompactionTest.java
##########
@@ -129,7 +129,7 @@ public void testAutoCompactionDutySubmitAndVerifyCompaction() throws Exception
           fullDatasourceName,
           AutoCompactionSnapshot.AutoCompactionScheduleStatus.RUNNING,
           0,
-          22482,
+          22481,

Review comment:
       Ah, that seems likely the reason. Thanks :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-672522776

> (1) and (3) are breaking changes.

Fwiw, on this one, I think the likelihood of (1) causing problems is low. It's a breaking change because if you were previously specifying a column as an input to one of your dimensionsSpec or aggregators, but then explicitly _not_ including in the input source's "dimensions" or "metrics" list, it'll now actually get read. Previously it'd be treated as null. The new behavior is better & less brittle but is different.

(3) is still fairly low, but somewhat more likely. It's possible that someone had their timestampSpec set to something like `{"column": "__time", "format": "iso"}`, which is "wrong" but would have worked previously. Now it won't work: the timestamp will fail to parse (as it should, because it's not really in iso format).

This patch as written just lets these things break, but we could cushion the fall, potentially:

- For (1) we could throw an error in situations where people explicitly specify `dimensions` and `metrics` that don't match the computed inputs, informing them that these parameters no longer have an effect and asking them to remove them.
- For (3) we could introduce a flag that causes an error to be thrown when you set the timestamp spec to anything other than `{"column": "__time", "format": "auto"}` or `{"column": "__time", "format": "millis"}`. Druid would invite you to set the flag, which would tell it "yes I really want the new, more correct behavior". That feature flag could be on or off by default.

What do people think?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] gianm commented on pull request #10267: DruidInputSource: Fix issues in column projection, timestamp handling.

Posted by GitBox <gi...@apache.org>.

gianm commented on pull request #10267:
URL: https://github.com/apache/druid/pull/10267#issuecomment-679860018


   @jon-wei I've pushed a new commit.
   
   > That sounds fine to me too, I don't really have a strong preference on that.
   
   > > Hmm, thinking about this some more, it may be best to not have it be an error. The risk of doing a sanity-check error is that something that worked before will stop working once you upgrade. It seems to me that any time someone had been specifying dimensions and metrics manually, but not including all input fields, they probably didn't do that intentionally. So this should be a bug fix that they appreciate, even though it's a behavioral change.
   > 
   > That sounds fine to me too, I don't really have a strong preference on that.
   
   For this one, I left it as not-an-error.
   
   > Ah, I was thinking that the auto/millis check in the new mode would only apply if the column being referenced was `__time`, so that users could still use alternate timestamp columns.
   > 
   > The warning log sounds fine to me.
   
   For this one, I added a config `druid.indexer.task.ignoreTimestampSpecForDruidInputSource` that defaults off (false). If enabled, timestampSpecs are ignored with the 'druid' input source, and a warning is logged each time a reader is created.
   
   Either way, a warning is logged if you read from the `__time` column with a format other than `millis` or `auto`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org