You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/18 17:41:11 UTC

[GitHub] [druid] rohangarg opened a new pull request #11949: Specify time column for first/last aggregators

rohangarg opened a new pull request #11949:
URL: https://github.com/apache/druid/pull/11949


   Add the ability to pass time column in first/last aggregator (and latest/earliest SQL functions). It is to support cases where the time to query upon is stored as a part of a column different than `__time`. Also, some other logical time column can be specified.
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] cheddar commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

cheddar commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r752850395



##########
File path: docs/querying/sql.md
##########
@@ -353,9 +353,13 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "earliest" is the value first encountered with the minimum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`EARLIEST(expr, timeColumn)`|Returns the earliest value of `expr`, which must be numeric. Earliest value is defined as the first last encountered with the minimum overall value of time column of all values being aggregated.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|

Review comment:
       > defined as the first last
   
   Well, which is it!?  I think you mean "value" instead of "last"

##########
File path: processing/src/test/java/org/apache/druid/query/aggregation/first/FloatFirstAggregationTest.java
##########
@@ -91,7 +95,25 @@ public void testDoubleFirstAggregator()
   }
 
   @Test
-  public void testDoubleFirstBufferAggregator()
+  public void testFloatFirstAggregatorWithTimeColumn()
+  {
+    Aggregator agg = new FloatFirstAggregatorFactory("billy", "nilly", "customTime").factorize(colSelectorFactory);
+
+    aggregate(agg);

Review comment:
       at 4 calls, a loop would save 1 line.  Who knows which is actually more better.  There is a simplicity to just having the same thing written 4 times.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r754998897



##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
##########
@@ -206,19 +195,74 @@ public Aggregation toDruidAggregation(
       );
     }
 
+    final String fieldName = getColumnName(plannerContext, virtualColumnRegistry, args.get(0), rexNodes.get(0));
+
+    final AggregatorFactory theAggFactory;
+    switch (args.size()) {
+      case 1:
+        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, -1);
+        break;
+      case 2:
+        if (outputType.anyOf(ValueType.STRING, ValueType.COMPLEX)) {

Review comment:
       the if check you mention seems better to me - changed to that. also, inlined the signature in the if-else block for better readability




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] clintropolis commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

clintropolis commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r752856348



##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
##########
@@ -262,6 +306,17 @@ public RelDataType inferReturnType(SqlOperatorBinding sqlOperatorBinding)
                   "'" + aggregatorType.name() + "(expr, maxBytesPerString)'\n",
                   OperandTypes.ANY,
                   OperandTypes.and(OperandTypes.NUMERIC, OperandTypes.LITERAL)
+              ),
+              OperandTypes.sequence(
+                  "'" + aggregatorType.name() + "(expr, timeColumn)'\n",
+                  OperandTypes.ANY,
+                  OperandTypes.ANY

Review comment:
       any seems a bit permissive, since i think this probably needs to be a long?

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
##########
@@ -206,19 +195,74 @@ public Aggregation toDruidAggregation(
       );
     }
 
+    final String fieldName = getColumnName(plannerContext, virtualColumnRegistry, args.get(0), rexNodes.get(0));
+
+    final AggregatorFactory theAggFactory;
+    switch (args.size()) {
+      case 1:
+        theAggFactory = aggregatorType.createAggregatorFactory(aggregatorName, fieldName, null, outputType, -1);
+        break;
+      case 2:
+        if (outputType.anyOf(ValueType.STRING, ValueType.COMPLEX)) {

Review comment:
       nit: might be worth a comment that this is to handle 2nd argument as either byte size or time column, depending on the input type to make it a bit easier to make sense of what's going on here.
   
   I think you could also potentially just make this check be `!outputType.isNumeric()` since those are fixed width and don't have a size limit

##########
File path: processing/src/main/java/org/apache/druid/query/aggregation/first/DoubleFirstAggregatorFactory.java
##########
@@ -256,20 +259,28 @@ public String getFieldName()
     return fieldName;
   }
 
+  @JsonProperty
+  public String getTimeColumn()
+  {
+    return timeColumn;
+  }
+
   @Override
   public List<String> requiredFields()
   {
-    return Arrays.asList(ColumnHolder.TIME_COLUMN_NAME, fieldName);
+    return Arrays.asList(timeColumn, fieldName);
   }
 
   @Override
   public byte[] getCacheKey()
   {
     byte[] fieldNameBytes = StringUtils.toUtf8(fieldName);
+    byte[] timeColumnBytes = StringUtils.toUtf8(timeColumn);

Review comment:
       nit: maybe a good opportunity to switch to `CacheKeyBuilder`? (same comment for all other aggs)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 commented on pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 commented on pull request #11949:
URL: https://github.com/apache/druid/pull/11949#issuecomment-978814700


   Merging since test failure is unrelated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 merged pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 merged pull request #11949:
URL: https://github.com/apache/druid/pull/11949


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r754998229



##########
File path: processing/src/main/java/org/apache/druid/query/aggregation/first/DoubleFirstAggregatorFactory.java
##########
@@ -256,20 +259,28 @@ public String getFieldName()
     return fieldName;
   }
 
+  @JsonProperty
+  public String getTimeColumn()
+  {
+    return timeColumn;
+  }
+
   @Override
   public List<String> requiredFields()
   {
-    return Arrays.asList(ColumnHolder.TIME_COLUMN_NAME, fieldName);
+    return Arrays.asList(timeColumn, fieldName);
   }
 
   @Override
   public byte[] getCacheKey()
   {
     byte[] fieldNameBytes = StringUtils.toUtf8(fieldName);
+    byte[] timeColumnBytes = StringUtils.toUtf8(timeColumn);

Review comment:
       yup, moved the aggs to `CacheKeyBuilder`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r754997651



##########
File path: docs/querying/aggregations.md
##########
@@ -143,101 +143,111 @@ Note that queries with first/last aggregators on a segment created with rollup e
 
 #### `doubleFirst` aggregator
 
-`doubleFirst` computes the metric value with the minimum timestamp or 0 in default mode, or `null` in SQL-compatible mode if no row exists.
+`doubleFirst` computes the metric value with the minimum value for time column or 0 in default mode, or `null` in 

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r755005430



##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java
##########
@@ -262,6 +306,17 @@ public RelDataType inferReturnType(SqlOperatorBinding sqlOperatorBinding)
                   "'" + aggregatorType.name() + "(expr, maxBytesPerString)'\n",
                   OperandTypes.ANY,
                   OperandTypes.and(OperandTypes.NUMERIC, OperandTypes.LITERAL)
+              ),
+              OperandTypes.sequence(
+                  "'" + aggregatorType.name() + "(expr, timeColumn)'\n",
+                  OperandTypes.ANY,
+                  OperandTypes.ANY

Review comment:
       true, moved it to `NUMERIC` - should we make it to `EXACT_NUMERIC` or does this look ok? long type isn't there. there's `INTEGER` too but I'm not sure if its in a mathematical sense or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r754997859



##########
File path: processing/src/test/java/org/apache/druid/query/aggregation/first/FloatFirstAggregationTest.java
##########
@@ -91,7 +95,25 @@ public void testDoubleFirstAggregator()
   }
 
   @Test
-  public void testDoubleFirstBufferAggregator()
+  public void testFloatFirstAggregatorWithTimeColumn()
+  {
+    Aggregator agg = new FloatFirstAggregatorFactory("billy", "nilly", "customTime").factorize(colSelectorFactory);
+
+    aggregate(agg);

Review comment:
       keeping it as is for now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] rohangarg commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

rohangarg commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r755005791



##########
File path: docs/querying/sql.md
##########
@@ -353,9 +353,13 @@ Only the COUNT, ARRAY_AGG, and STRING_AGG aggregations can accept the DISTINCT k
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats extension](../development/extensions-core/stats.md) documentation for additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. If `expr` comes from a relation with a timestamp column (like a Druid datasource) then "earliest" is the value first encountered with the minimum overall timestamp of all values being aggregated. If `expr` does not come from a relation with a timestamp, then it is simply the first value encountered.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`EARLIEST(expr, timeColumn)`|Returns the earliest value of `expr`, which must be numeric. Earliest value is defined as the first last encountered with the minimum overall value of time column of all values being aggregated.|`null` if `druid.generic.useDefaultValueForNull=false`, otherwise `0`|

Review comment:
       you are right - fixed. copy-paste error!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] LakshSingla commented on a change in pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

LakshSingla commented on a change in pull request #11949:
URL: https://github.com/apache/druid/pull/11949#discussion_r752561204



##########
File path: processing/src/test/java/org/apache/druid/query/aggregation/first/FloatFirstAggregationTest.java
##########
@@ -91,7 +95,25 @@ public void testDoubleFirstAggregator()
   }
 
   @Test
-  public void testDoubleFirstBufferAggregator()
+  public void testFloatFirstAggregatorWithTimeColumn()
+  {
+    Aggregator agg = new FloatFirstAggregatorFactory("billy", "nilly", "customTime").factorize(colSelectorFactory);
+
+    aggregate(agg);

Review comment:
       nit: Should this be done in a loop? 

##########
File path: docs/querying/aggregations.md
##########
@@ -143,101 +143,111 @@ Note that queries with first/last aggregators on a segment created with rollup e
 
 #### `doubleFirst` aggregator
 
-`doubleFirst` computes the metric value with the minimum timestamp or 0 in default mode, or `null` in SQL-compatible mode if no row exists.
+`doubleFirst` computes the metric value with the minimum value for time column or 0 in default mode, or `null` in 

Review comment:
       nit: Unnecessary line break 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] abhishekagarwal87 closed pull request #11949: Specify time column for first/last aggregators

Posted by GitBox <gi...@apache.org>.

abhishekagarwal87 closed pull request #11949:
URL: https://github.com/apache/druid/pull/11949


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org