You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "paul-rogers (via GitHub)" <gi...@apache.org> on 2023/02/12 21:05:16 UTC

[GitHub] [druid] paul-rogers commented on pull request #13793: Add validation for aggregations on __time

paul-rogers commented on PR #13793:
URL: https://github.com/apache/druid/pull/13793#issuecomment-1427132705

   This is an interesting issue as it touches on a fundamental ambiguity in MSQ. MSQ uses a `SELECT` to perform aggregation. There are two ways to interpret such aggregations:
   
   * Do the aggregations within the `SELECT` and insert the resulting rows as detail rows into a detail datasource.
   * Do the aggregations within not just the `SELECT` but also during compaction. Insert aggregated rows into a rollup datasource.
   
   The first interpretation is standard SQL in which each `SELECT` is a black box which simply returns rows. The second interpretation is required for compaction to work. In this case, if we write the finalized `EARLIEST` into the datasource, compaction cannot then take two segments with the same time period and combine them to find the "earliest EARLIEST".
   
   As a result, we want MSQ semantics to differ from those of SQL: the contents of the `SELECT` statements "leak out" and are written to segments so that compaction can pick up where MSQ left off and continue aggregating. This is non-standard SQL, and causes us to rethink many parts of SQL handling (and to work around the fact that Calcite assumes SQL semantics.)
   
   Now to this specific case. Suppose I write my query as `LATEST_BY(foo, timestamp)` where `timestamp` is a column in the external table. This presents a variety of issues.
   
   * The `timestamp` column is probably not a `TIMESTAMP`. So, I must write `LATEST_BY(foo, TIME_PARSE(timestamp))`. This solves the first case above: the SQL is a black box.
   * SQL semantics say that I can use SELECT columns in an aggregate. So, it should be perfectly legal to say `LATEST_BY(foo, __time)` where `__time` has been defined as `TIME_PARSE(timestamp) AS __time`.
   * If we go with the `LATEST_BY(foo, TIME_PARSE(timestamp))`, then we cannot do the rollup use case: the `timestamp` column does not exist in the generated segments, and so compaction cannot further compact the columns.
   * A workaround might be to write `timestamp` into the segment so it is available for compaction. However, this will be confusing for the user: why must they save the same column twice, so that a column name from the _input_ is available in the _output_?
   
   Or, perhaps the goal is that `LATEST_BY` cannot be further aggregated (because the input column is no longer available.) If that is the goal, one could argue that the result is a bug. If we force users to use `LATEST_BY` because `LATEST` is broken, but `LATEST_BY` is also broken, we've not actually made much progress.
   
   So, what should we do? We should ensure that aggregators refer to columns from the `SELECT` clause, not the input. That is:
   
   ```sql
   INSERT INTO dst
   SELECT TIME_PARSE(timestamp) as __time, SUM(CAST(value) AS BIGINT)) AS mySum
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   Sums the `value` column after parsing. The change here suggests I should have written:
   
   ```sql
   INSERT INTO dst
   SELECT TIME_PARSE(timestamp) as __time, SUM(value) AS mySum
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   That is, I have to `SUM()` the input column: but that input column is a `VARCHAR` and not valid to sum.
   
   Let's apply this to the case at hand. What we want is to refer to the output columns:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST_BY(CAST(value) AS BIGINT, __time) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   Or, better, because `__time` is, in fact, an output column why not just use `LATEST`:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST(CAST(value) AS BIGINT) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   If we go with the rule here, then I would have to write:
   
   ```sql
   INSERT INTO dst
   SELECT
     TIME_PARSE(timestamp) as __time,
     LATEST_BY(value, timestamp) AS latestValue
   FROM TABLE(EXTERN(...))
     (timestamp VARCHAR, value VARCHAR
   ```
   
   The proposed solution goes against SQL semantics, but cannot even be made right. The types of both `value` and `timestamp` are wrong. To fix that I suppose I could use a nested select:
   
   ```sql
   INSERT INTO dst
   SELECT
     __time,
     LATEST_BY(numValue, __time) AS latestValue
   FROM
     SELECT
       TIME_PARSE(timestamp) as __time,
       CAST(value AS DOUBLE) as numValue
     FROM TABLE(EXTERN(...))
       (timestamp VARCHAR, value VARCHAR
   ```
   
   Still, however, to follow SQL, the `_time` in `LATSET_BY` would be the one from the outer `SELECT`. But, MSQ needs it to be the one from the inner `SELECT`. We will find we must rewrite Calcite (and rethink hundreds of page of SQL semantics) to make that work.
   
   In short, this bug could use a bit more design thinking before we propose a code change.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org