You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "paul-rogers (via GitHub)" <gi...@apache.org> on 2023/03/02 02:31:19 UTC
[GitHub] [druid] paul-rogers commented on a diff in pull request #13793: Add validation for aggregations on __time

paul-rogers commented on code in PR #13793:
URL: https://github.com/apache/druid/pull/13793#discussion_r1122534301


##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -196,6 +197,16 @@ public Aggregation toDruidAggregation(
 
     final String fieldName = getColumnName(plannerContext, virtualColumnRegistry, args.get(0), rexNodes.get(0));
 
+    if (!rowSignature.contains(ColumnHolder.TIME_COLUMN_NAME) && (aggregatorType == AggregatorType.LATEST || aggregatorType == AggregatorType.EARLIEST)) {
+      plannerContext.setPlanningError("%s() aggregator depends on __time column, the underlying datasource "
+                                      + "or extern function you are querying doesn't contain __time column, "
+                                      + "Please use %s_BY() and specify the time column you want to use",
+                                      aggregatorType.name(),
+                                      aggregatorType.name()
+      );
+      return null;
+    }

Review Comment:
   I get that the user *could* use the alternative. The goal here, _should be_ that the user can use the actual features which Druid claims it supports.
   
   That is, I'm asserting that the goal of this PR should not be not to accept broken behavior, and tell the user not to use the broken stuff, but to fix the behavior so what should work _does_ work.
   
   In short, we don't want to wrap a bug with nicer error messages. We want to fix the bug. (IMHO.)



##########
extensions-core/multi-stage-query/src/test/java/org/apache/druid/msq/exec/MSQSelectTest.java:
##########
@@ -1363,6 +1363,39 @@ public void testGroupByArrayWithMultiValueMvToArray()
         .verifyResults();
   }
 
+  @Test
+  public void testTimeColumnAggregationFromExtern() throws IOException
+  {
+    final File toRead = MSQTestFileUtils.getResourceAsTemporaryFile(temporaryFolder, this, "/wikipedia-sampled.json");
+    final String toReadAsJson = queryFramework().queryJsonMapper().writeValueAsString(toRead.getAbsolutePath());
+
+    RowSignature rowSignature = RowSignature.builder()
+                                            .add("__time", ColumnType.LONG)
+                                            .add("cnt", ColumnType.LONG)
+                                            .build();
+
+    testSelectQuery()
+        .setSql("WITH\n"
+                + "kttm_data AS (\n"
+                + "SELECT * FROM TABLE(\n"
+                + "  EXTERN(\n"
+                + "    '{ \"files\": [" + toReadAsJson + "],\"type\":\"local\"}',\n"
+                + "    '{\"type\":\"json\"}',\n"
+                + "    '[{\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": \"string\"}]'\n"
+                + "  )\n"
+                + "))\n"
+                + "\n"
+                + "SELECT\n"
+                + "  FLOOR(TIME_PARSE(\"timestamp\") TO MINUTE) AS __time,\n"
+                + "  LATEST(\"page\") AS \"page\"\n"
+                + "FROM kttm_data "
+                + "GROUP BY 1")
+        .setExpectedValidationErrorMatcher(
+            ThrowableMessageMatcher.hasMessage(CoreMatchers.containsString("LATEST() aggregator depends on __time column"))

Review Comment:
   Sorry, I didn't follow. Maybe my note was too terse?
   
   The query here should be legal. `page` is a `VARCHAR` (Druid `string`) column. Segments allow `LATEST(string)` (though not `LATEST` of numeric types.) MSQ is the modern way to create rollup tables.
   
   Now, this query might fail because we haven't passed the context parameter to ask for a rolled up datasource. But, in that case, aggregations are applied first, and the results written to the datasource as a detail table. In that case, the above query should also work.
   
   The test, however, expects the query to fail. My comment is noting that this is not what should happen. (I get that it is what this PR _wants_ to happen.) What should happen is that this is legal, and we emit segments based on the `finalzieAggregations` context parameter.
   
   So, the ask is simple: let's find out what it take so that this query passes and (with the right context setting), emits an intermediate `COMPLEX<Pair<string, long>>` value into the output segments.



##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteJoinQueryTest.java:
##########
@@ -1447,12 +1447,24 @@ public void testInnerJoinQueryOfLookup(Map<String, Object> queryContext)
                 .build()
         ),
         ImmutableList.of(
-            new Object[]{"", "a", "xabc", "xabc"},
-            new Object[]{"1", "a", "xabc", "xabc"}
+            new Object[]{"", "a", "xa", "xa"},
+            new Object[]{"1", "a", "xa", "xa"}
         )
     );
   }
 
+  @Test(expected = UnsupportedSQLQueryException.class)

Review Comment:
   Absolutely. Every datasource row has a `__time` column.
   
   Here's another way to express this for the `SELECT` case. If the query works for the interactive SQL engine, it should work for the MSQ query engine. It is not a benefit to the user that `LATEST(x)` works on one engine but not the other. It is, instead, a bug to be fixed.
   
   It certainly true that it might be hard to fix given how MSQ does things for some reason. That just means fixing it might be hard.
   
   If, however, this query does not work on the interactive SQL engine, then we've got larger issues which we may not be able to fix in this PR. One issue is, "what does it mean to offer `LATEST(x)` if it doesn't actually work?"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org