You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "adarshsanjeev (via GitHub)" <gi...@apache.org> on 2023/02/20 06:34:16 UTC

[GitHub] [druid] adarshsanjeev opened a new issue, #13824: Aggregations on __time column do not work as expected

adarshsanjeev opened a new issue, #13824:
URL: https://github.com/apache/druid/issues/13824

   Aggregations which implicitly depend on the "__time" column (such as LATEST() or EARLIEST()) when performed on an extern source in MSQ would result in the values for time defaulting to null.
   The same also occurs for non MSQ queries for the same aggregations on a lookup as the source (as __time is not present).
   
   This can be reproduced with the following query:
   
   ```
   WITH "ext" AS (SELECT *
   FROM TABLE(
     EXTERN(
       '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}',
       '{"type":"json"}',
       '[{"name":"isRobot","type":"string"},{"name":"channel","type":"string"},{"name":"timestamp","type":"string"},{"name":"flags","type":"string"},{"name":"isUnpatrolled","type":"string"},{"name":"page","type":"string"},{"name":"diffUrl","type":"string"},{"name":"added","type":"long"},{"name":"comment","type":"string"},{"name":"commentLength","type":"long"},{"name":"isNew","type":"string"},{"name":"isMinor","type":"string"},{"name":"delta","type":"long"},{"name":"isAnonymous","type":"string"},{"name":"user","type":"string"},{"name":"deltaBucket","type":"long"},{"name":"deleted","type":"long"},{"name":"namespace","type":"string"},{"name":"cityName","type":"string"},{"name":"countryName","type":"string"},{"name":"regionIsoCode","type":"string"},{"name":"metroCode","type":"long"},{"name":"countryIsoCode","type":"string"},{"name":"regionName","type":"string"}]'
     )
   ))
   SELECT
     TIME_PARSE("timestamp") AS "__time",
     LATEST("comment",1024)
   FROM "ext"
   GROUP BY 1
   ```
   with query context 
   `"finalizeAggregations": false`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [I] Aggregations on __time column do not work as expected (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #13824:
URL: https://github.com/apache/druid/issues/13824#issuecomment-1979852055

   This issue has been closed due to lack of activity. If you think that
   is incorrect, or the issue requires additional review, you can revive the issue at
   any time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] adarshsanjeev commented on issue #13824: Aggregations on __time column do not work as expected

Posted by "adarshsanjeev (via GitHub)" <gi...@apache.org>.
adarshsanjeev commented on issue #13824:
URL: https://github.com/apache/druid/issues/13824#issuecomment-1438347030

   The cause of the problem seems to be present in StringLastAggregatorFactory, which is used in case of EARLIEST and LATEST. In case the time column is not known, we default to ColumnHolder.TIME_COLUMN_NAME ("__time"). This is under the assumption that a time column should be present and works for non MSQ queries. For some MSQ queries which read from an external source, the __time column is present in the output, but during aggregation, might be referred to by a temporary name or virtual column.
   
   An ideal solution would be to handle reading from aliased columns directly. This would help for queries like
   ```
     TIME_PARSE("timestamp") AS "__time",
     LATEST_BY("comment", "__time", 1024),
   ```
   which do not work currently.
   
   An alternate solution could be to handle EARLIEST and LATEST as a special case for now. We could change the implicit reference to the __time column. MSQTaskQueryMaker has the necessary mappings to know what is mapped to the __time column in the output. ColumnMappings contains the mapping of __time to a the intermediate column, (MSQ sets CTX_TIME_COLUMN_NAME to this in its query context) and dimensions contains the mappings of virtual columns. Changing this reference from __time to the column that is mapped to it in the final output produces the expected output of latest in the above query with LATEST. This might need some additional changes to support compaction, but it could be able to handle this case if the reference to the column is changed to the __time column during the process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] adarshsanjeev commented on issue #13824: Aggregations on __time column do not work as expected

Posted by "adarshsanjeev (via GitHub)" <gi...@apache.org>.
adarshsanjeev commented on issue #13824:
URL: https://github.com/apache/druid/issues/13824#issuecomment-1449425180

   https://github.com/apache/druid/pull/13793 adds a validation to check that EARLIEST or LATEST cannot be used if there is no __time column in the input schema, with LATEST_BY as a workaround. EARLIEST and LATEST currently assume that there is a __time column, which might not be what the user expects.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [I] Aggregations on __time column do not work as expected (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #13824:
URL: https://github.com/apache/druid/issues/13824#issuecomment-1931002221

   This issue has been marked as stale due to 280 days of inactivity.
   It will be closed in 4 weeks if no further activity occurs. If this issue is still
   relevant, please simply write any comment. Even if closed, you can still revive the
   issue at any time or discuss it on the dev@druid.apache.org list.
   Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [I] Aggregations on __time column do not work as expected (druid)

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #13824: Aggregations on __time column do not work as expected 
URL: https://github.com/apache/druid/issues/13824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org