You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Will Lauer <wl...@verizonmedia.com.INVALID> on 2020/05/08 19:16:27 UTC
Advice needed for new forecasting and anomaly queries

I'm looking for some advice about the best way to design some new query
functionality that I need.

Some background:
We need some forecasting and anomaly detection capabilities in our reports,
and for a number of reasons, it looks like these would best fit in the
druid layer rather than in the web service that we use to wrap druid. We'd
like to be able to do a number of things (possibly all different query
types)

   1. Generate a time series (possibly including multiple metrics),
   including a "forecast" into the future of the expected values given some
   amount of history. So, possibly displaying only the current quarter, with
   actual values for the quarter to date along with forecast values (and error
   bounds) for the rest of the quarter based on the last 2 years of data.
   2. Generate a time series of actual data overlapping with it expected
   seasonally adjusted value based on N prior periods.
   3. Generate a time series with "anomalous" periods identified (based on
   some amount of history).
   4. Decompose time series into trend, seasonal, residual values.
   5. Possibly support all prior functionality on group by queries (with
   sufficiently low cardinality), moving average query, lookback queries, etc.

The question:
Typically, I would implement these functionalities as new query types that
simply nests the underlying time series (or other) query, and processes the
resulting data, feeding it through the appropriate trending, forecasting,
and anomaly detection algorithms, before emitting the desired result. But
with Druid's ongoing move to SQL as its primary query interface, I'm trying
to rethink now this should be implemented and whether there is anything
different I should do to fit these new functionalities into the Druid
architecture.

If I leave this as a new query type using the traditional JSON-based query
API, I need to decide how it should be structured, answering questions like

   - is the "nested" query that should be run represented as a "query"
   datasource, or does the query just mimic the API of timeseries (or maybe
   groupby)
   - how should the historical baseline be represented given that the query
   naturally uses two intervals - the interval to report over vs the interval
   to use for computing the historical baseline.
   - how should the desired analysis be represented (attributes of the
   query, "magical" postaggs that trigger high level analysis, etc)

If the new functionality should be exposed via the SQL api, what is the
best way to do that? How are new query types integrated with SQL? What is
the best way to represent this type of data in SQL in the first place?

Any thoughts that people have would be appreciated.

Will

<http://www.verizonmedia.com>

Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

<http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
<https://www.linkedin.com/company/verizon-media/>
<http://www.instagram.com/verizonmedia>