You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/08/05 18:26:11 UTC
[GitHub] gianm commented on issue #6105: Allow sorting segments on some dims before time

gianm commented on issue #6105: Allow sorting segments on some dims before time
URL: https://github.com/apache/incubator-druid/issues/6105#issuecomment-410538507
 
 
   I changed the title a bit, since we do allow specifying the sort order for every column except __time (it's the order from dimensionsSpec) but we do not let you put time in the middle of them. So this issue may lead to adding a sortOrder parameter, but I wanted the title to capture the problem we're trying to solve.
   
   For anyone reading, check out #6066 for motivation and what workarounds people use today.
   
   As to removing the assumption that `__time` is sorted, these are the places I can think of where that exists:
   
   1. The `makeCursors` function in QueryableIndexStorageAdapter is responsible for creating a sequence of time-granular cursors. This will probably be the toughest thing to adjust, since many query engines depend on the time-ordered and time-granularized nature of these cursors (including the timeseries, topN, groupBy, and timeBoundary engines). Probably for timeseries the best thing to do is call through to the groupBy engine, which can handle its grouping keys being out of order, and is capable of grouping on time. For topN and groupBy I'm not so sure. For groupBy in particular keep in mind #1926; the behavior of "granularity" is not equivalent to the behavior of adding a time-floored dimension.
   2. makeCursors again: its time filtering functionality assumes that the time column is sorted.
   3. SingleScanTimeDimSelector is used for extractionFns on the `__time` column, and it also assumes the column is sorted.
   4. SingleLongInputCachingExpressionColumnValueSelector is used for expressions on the `__time` column. It doesn't assume sortedness, but the optimization it does may not make as much sense if the column is sorted.
   5. The `getMinTime()` and `getMaxTime()` functions in QueryableIndexStorageAdapter assume that they should return the first and last row, respectively.
   
   Some other considerations:
   
   - Given (1) we will probably want to translate some `granularity` based queries into queries that use the `timestamp_floor` expression. For example, a timeseries with granularity less than the segment interval should get executed the same way as a `timestamp_floor` groupBy, and we could make that translation under the hood. Therefore we should look into optimizing `timestamp_floor`. We should expect at least _some_ localized sortedness in the time column and the function should take advantage of that. Currently, it doesn't.
   - If it makes life easier for a first version of this patch, we could consider _not_ supporting "granularity" based queries for topN and groupBy on non-time-sorted segments at first. I believe they are less commonly used than granular timeseries, and could be added in later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org