You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2018/11/10 07:10:00 UTC

[GitHub] john-bodley opened a new issue #6360: [SIP-15] Transparent and Consistent Time Interval Principles

john-bodley opened a new issue #6360: [SIP-15] Transparent and Consistent Time Interval Principles
URL: https://github.com/apache/incubator-superset/issues/6360
 
 
   ### Motivation
   
   One of the core tenants of any BI tool is accuracy, however we have seen instances where the time interval provides misleading or potentially incorrect results due to the nuances in how Superset currently i) uses different logic depending on the connector, and ii) mixes of datetime resolutions.   
   
   #### The time-interval conundrum
   
   A time interval is defined by a `start` and `end` time and whether the limits are inclusive (`[`, `]`) or exclusive (`(`, `)`). This leads to four possible definitions:
   
   1. `[start, end]`
   2. `[start, end)`
   3. `(start, end)`
   4. `(start, end]`
   
   The underlying issue is it is not apparent to the user in the UI when the specify a time range which definition Superset is invoking. Sadly the answer depends on the connector (SQLAlchemy or Druid REST API) and in the case of SQLAlchemy engines the underlying structure of the datasource. This leads to several issues:
   
   - Inconsistent behavior between connectors.
   - Incorrect aggregations for certain chart types.
   
   The later is especially concerning for chart types which obfuscate the temporal component or for time grains which use aggregation and thus potentially providing incorrect results, i.e., a user may think they are aggregating a week's worth of data (seven days) whereas the reality is they are only aggregating six days due to the inclusion/exclusion logic.
   
   ##### Druid REST API
   
   The Druid REST API uses the `[start, end)` definition (2) for time intervals although it is not explicitly mentioned in their [documentation](http://druid.io/docs/latest/querying/timeseriesquery.html) though is defined [here](https://github.com/apache/incubator-druid/issues/5055#issuecomment-342491033).
   
   ##### SQLAlchemy
   
   The SQLAlchemy engines use the `[start, end]` definition (1), i.e., the time filter limits are defined via `>=` and `<=` conditions respectively. Note however unbeknown to the user the filter may behave like `(start, end]` (3). Why? The reason is due to the potential mixing of dates and timestamps and using lexicographical order for evaluating clauses, i.e., assume that the time column `ds` is defined as a date string, then a filter of the form,
   ```
   WHERE
       ds >= '2018-01-01 00:00:00' AND
       ds <= '2018-01-02 00:00:00'
   ```
   for a set of `ds` (datestamp) values of [`2018-01-01`, `2018-01-02`] results in,
   ```
   > SELECT
       '2018-01-01' >= '2018-01-01 00:00:00',
       '2018-01-01' <= '2018-01-02 00:00:00'
   FALSE TRUE
   ```
   and
   ```
   > SELECT
       '2018-01-02' >= '2018-01-01 00:00:00',
       '2018-01-02' <= '2018-01-02 00:00:00'
   TRUE TRUE
   ```
   respectively. Due to the lexicographical order the `[start` actually acts like `(start` which is probably not what the user expected.
   
   ### Proposed Change
   
   I propose there are two things we need to solve:
   
   1. **Consistency**: Ensure that all connections and datasources use the same interval definitions.
   2. **Transparency**: Explicitly call out in the UI what the interval definition is.
   
   #### Consistency
   
   Firstly which of the four definitions make the most sense? I propose Druid's definition of `[start, end)` (2) makes the most sense as it guarantees to capture the entire time interval regardless of the time resolution in the data, i.e., for SQL a 24 hour interval would be expressed as:
   ```
   WHERE
       time >= TIMESTAMP '2018-01-01 00:00:00' AND
       time < TIMESTAMP '2018-02-01 00:00:00'
   ```
   The reason not to opt for `[start, end]` (1) is this 24 hour interval could potentially be expressed as:
   ```
   WHERE
       time >= TIMESTAMP '2018-01-01 00:00:00' AND
       time <= TIMESTAMP '2018-02-01 23:59:59'
   ```
   however it assumes that the finest granularity that `time` column is defined is in seconds. In the case of milliseconds it wouldn't capture most of the last second in the 24 hour period. Note the current UI only has second precision.
   
   Secondly how to we address lexicographical issue caused by mixing of date and timestamp strings? Given that we explicitly define the time as time (rather than date) we should enforce that all time columns are cast/converted to a timestamp, i.e., in the case of Presto either:
   ```
   WHERE
       DATE_PARSE(ds, '%Y-%m-%d') >= TIMESTAMP '2018-10-01 00:00:00' AND
       DATE_PARSE(ds, '%Y-%m-%d') < TIMESTAMP '2018-10-31 00:00:00'
   ```
   (preferred) or
   ```
   WHERE
       CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) => '2018-10-01 00:00:00' AND
       CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) < '2018-10-31 00:00:00'
   ```
   
   #### Transparency
   
   I sense we need to improve the UI to explicitly call out:
   
   1. The interval definition.
   2. The relative time.
   
   ##### The interval definition
   
   I sense a tooltip here would be suffice simply mentioning that the start is inclusive and the end is exclusive of.
   
   #### The relative time
   
   Both defaults and custom time periods are relative to some time. The custom time mentions "Relative to today" however it isn't clear if today means a time (now, 12:00:00 am, etc.) or a date. Furthermore defaults have no mention what the reference time is. 
   
   <img width="282" alt="screen shot 2018-11-09 at 9 52 10 pm" src="https://user-images.githubusercontent.com/4567245/48298393-95c5b680-e471-11e8-9070-6d79d607687e.png"><img width="280" alt="screen shot 2018-11-09 at 9 52 22 pm" src="https://user-images.githubusercontent.com/4567245/48298400-965e4d00-e471-11e8-91e7-0cdf04bca61f.png">
   
   In both instances it's today at 12:00:00 am and there seems to be merit in explicitly calling this out. Additionally there may be merit in having an asterisk (or similar) when a relative time period is chosen, i.e., `Last 7 days *` to help ground the reference to today at 12:00:00 am. 
   
   <img width="132" alt="screen shot 2018-11-09 at 10 03 13 pm" src="https://user-images.githubusercontent.com/4567245/48298396-95c5b680-e471-11e8-8349-02dd77119796.png">
   
   <img width="279" alt="screen shot 2018-11-09 at 9 56 44 pm" src="https://user-images.githubusercontent.com/4567245/48298399-965e4d00-e471-11e8-93e9-f276ca44edba.png"><img width="279" alt="screen shot 2018-11-09 at 9 57 24 pm" src="https://user-images.githubusercontent.com/4567245/48298398-965e4d00-e471-11e8-916e-02560856826b.png">
   
   #### Concerns
   
   I sense there are potentially two major concerns:
   
   1. Migrations.
   2. Performance. 
   
   ##### Migrations 
   
   If you asked a producer of a chart which of the four time interval definitions was being adhered to you would get the full gamut of responses, i.e., it's not evident to them exactly what the current logic is and thus it's not evident to us how we would migrate time intervals which used an absolute time for either the start or end. I sense the only solution here is to realize that this is a breaking change which though challenging provides mores transparency and consistency in the future. An institution would probably want to inform their customers of such a change via a PSA or similar.  
   
   ##### Performance
   
   At Airbnb our primary SQLAlchemy engine is Presto where the underlying table is partitioned by datestamp (denoted as `ds`). One initial concern I had was by enforcing time time column to represent a timestamp using a combination of Presto's [date and time functions](https://prestodb.io/docs/current/functions/datetime.html) that the partition predicate logic would not be enforce resulting in a full table-scan which would not be performant especially for tables spanning multiple years. The concern didn't materialize as the following queries provided similar performance:
   ```
   SELECT
       COUNT(1)
   FROM
       <table>
   WHERE
       ds >= '2018-01-01 00:00:00' AND
       ds <= '2018-01-31 00:00:00'
   ```
   ```
   SELECT
       COUNT(1)
   FROM
       <table>
   WHERE
       DATE_PARSE(ds, '%Y-%m-%d') > TIMESTAMP '2018-10-01 00:00:00' AND
       DATE_PARSE(ds, '%Y-%m-%d') <= TIMESTAMP '2018-10-31 00:00:00'
   ```
   ```
   SELECT
       COUNT(1)
   FROM
       <table>
   WHERE
       CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) => '2018-10-01 00:00:00' AND
       CAST(DATE_PARSE(ds, '%Y-%m-%d') AS VARCHAR) < '2018-10-31 00:00:00'
   ```
   ### New or Changed Public Interfaces
   
   Added clarity to the time range widget.
   
   ### New dependencies
   
   None.
   
   ### Migration Plan and Compatibility
   
   There are no planed migrations however this would be a breaking change.
   
   ### Rejected Alternatives
   
   None.
   
   to: @betodealmeida @fabianmenges @graceguo-supercat @jeffreythewang @kristw @michellethomas @mistercrunch @timifasubaa  @williaster 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org