You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/05/01 18:37:47 UTC

[GitHub] [incubator-superset] betodealmeida opened a new issue #7425: [SIP] Scheduling queries from SQL Lab

betodealmeida opened a new issue #7425: [SIP] Scheduling queries from SQL Lab
URL: https://github.com/apache/incubator-superset/issues/7425
 
 
   ## [SIP] Proposal for scheduling queries from SQL Lab
   
   ### Motivation
   
   A common approach for building dashboards involves:
   
   1. User writes a complex query in SQL Lab, often with joins, to get the data they need.
   2. User clicks "visualize", to explore the data.
   3. User builds a visualization. This step is usually slow, since the SQL query has to be recomputed every time.
   4. User builds a dashboard using the visualization. It's often slow, since the SQL query has to be recomputed every time.
   5. In order to get fresh data in the dashboard, user has to either update the underlying SQL query or write expensive queries using macros (scanning 7 days of data, for example).
   
   We want to optimize that process, allowing the user to write **one** query in SQL Lab that runs periodically. The query should scan data only for its interval (1 day of data for a daily schedule, for example). This way dashboards can be kept up-to-date with cheap queries.
   
   In this SIP, I propose a way of scheduling queries from within SQL Lab. The actual scheduling of the query is left to an external service (like [Apache Airflow](https://airflow.apache.org), for example). Superset will simply enrich saved queries with additional metadata for the external scheduler.
   
   The proposal is scheduler-agnostic, and can be used with Apache Airflow, Luigi or any other scheduler, since the form for collecting the metadata needed is defined in `config.py` using [react-jsonschema-form](https://github.com/mozilla-services/react-jsonschema-form).
   
   ### Proposed Change
   
   In this SIP we propose adding a new feature flag called `SCHEDULED_QUERIES`. Instead of a boolean, the feature flag would be a dictionary with two keys, `JSONSCHEMA` and `UISCHEMA` (see discussion [here](https://github.com/apache/incubator-superset/pull/6943#issuecomment-469990969) for using a dict as a feature flag). As an example:
   
   ```python
   FEATURE_FLAGS = {
       # Configuration for scheduling queries from SQL Lab. This information is
       # collected when the user clicks "Schedule query", and saved into the `extra`
       # field of saved queries.
       # See: https://github.com/mozilla-services/react-jsonschema-form
       'SCHEDULED_QUERIES': {
           'JSONSCHEMA': {
               'title': 'Schedule',
               'description': (
                   'In order to schedule a query, you need to specify when it '
                   'should start running, when it should stop running, and how '
                   'often it should run. You can also optionally specify '
                   'dependencies that should be met before the query is '
                   'executed. Please read the documentation for best practices '
                   'and more information on how to specify dependencies.'
               ),
               'type': 'object',
               'properties': {
                   'output_table': {
                       'type': 'string',
                       'title': 'Output table name',
                   },
                   'start_date': {
                       'type': 'string',
                       'format': 'date-time',
                       'title': 'Start date',
                   },
                   'end_date': {
                       'type': 'string',
                       'format': 'date-time',
                       'title': 'End date',
                   },
                   'schedule_interval': {
                       'type': 'string',
                       'title': 'Schedule interval',
                   },
                   'dependencies': {
                       'type': 'array',
                       'title': 'Dependencies',
                       'items': {
                           'type': 'string',
                       },
                   },
               },
           },
           'UISCHEMA': {
               'schedule_interval': {
                   'ui:placeholder': '@daily, @weekly, etc.',
               },
               'dependencies': {
                   'ui:help': (
                       'Check the documentation for the correct format when '
                       'defining dependencies.'
                   ),
               },
           },
       },
   }
   ```
   
   The configuration is used to dynamically generate a form for collecting the extra metadata needed in order to schedule the query. The example above should work for many schedulers, but it can also be easily adapted (or completely changed) depending on the needs.
   
   If this flag is present, SQL Lab will show a button label "Schedule Query":
   
   <img width="594" alt="Screen Shot 2019-05-01 at 11 29 46 AM" src="https://user-images.githubusercontent.com/1534870/57034484-79c2f680-6c04-11e9-900e-7dc05e40bf6b.png">
   
   Clicking it pops up a modal:
   
   <img width="632" alt="Screen Shot 2019-05-01 at 11 31 05 AM" src="https://user-images.githubusercontent.com/1534870/57034550-a840d180-6c04-11e9-957f-258d3270a7f6.png">
   
   When the user clicks "Submit" the query is saved (just like a saved query) with the schedule information stored in its JSON metadata. The user can edit the query, like any saved query, and the scheduler can fetch the scheduled queries using the API provided by FAB.
   
   ### New or Changed Public Interfaces
   
   None.
   
   ### New dependencies
   
   [react-jsonschema-form](https://github.com/mozilla-services/react-jsonschema-form) is an Apache 2 licensed project created by Mozilla. It was last updated 14 days ago, and has ~35k weekly downloads.
   
   ### Migration Plan and Compatibility
   
   None.
   
   ### Rejected Alternatives
   
   We consider using celery workers to run the queries, but this would add a lot of complexity for backfills, alerting, etc. The proposed approach leverages existing schedulers, leaving to Superset only the task of annotating queries with extra metadata.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org