You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@superset.apache.org by gi...@git.apache.org on 2017/10/02 17:30:00 UTC

[GitHub] rjpereira opened a new issue #3564: Request for Comments : Tableau-like data extracts feature

rjpereira opened a new issue #3564: Request for Comments : Tableau-like data extracts feature
URL: https://github.com/apache/incubator-superset/issues/3564

My team is about to start implementing a functionality in Superset which I believe could have a general use. So while, we may have to cut some corners to get it working soon rather than later, would welcome your suggestions.

This feature implements the Tableau-like TDE extracts (or to some extent Quicksight's SPICEs). The feature doesn't assume any "optimized-storage-magic" : it just intends to materialise views of other databases in a Superset owned storage, most probably living in the same server as Superset's own metadata. For now calling it tentatively QDE : Query Data Extract.

My idea is to extend the schema of Saved Queries, to have attributes to flag query as "to be extracted", "extract name" and crontab frequency of refresh. This would extend the saved_query table and the UI that edits the queries.

A working setup of QDE, would require the launch of celery beat, with a configurable polling frequency (say 5 mins). The worker task already supported by superset workers, would scan the list of saved queries that were due a refresh. For this they would expand the crontab expression stored in the db, together with the last extraction time also stored on saved_query, and if there was a new "run time" between last run and now, the query would be executed (with code similar to SQL Lab), but the results would then be uploaded to the QDE database, rather than to Werkzeug cache as Async queries. This code would of course, create or replace schemas if the schema of the query didn't match that on QDE. With time, this should have a "smart sync logic" but that is more difficult to generalise. The extracts would be saved in QDE with the "Extract name" added to Saved Query.

The use of extracts was as per normal use of a database, i.e. we would have QDE as a source of "materialisations" of all the "slow sources" we would normally use in SQL Lab.

Additionally to this, we will be refreshing slice caches that are due to expire in workers. Two separate ideas here:
a) We could check all the slices that are based on QDE extracts, and after a refresh of an extract, we would run in a worker the query of the slice, forcing the execution, to extend the cache time.
b) Even those not supported on QDE we could check those with less than X hours to expiry, and re-run them in the background.

Those last two options, I understand do not apply to everyone. Lazy caching is different than active caching, which is what we need, but maybe not everyone else. Some more UI would be required to give an option to "Actively refresh slices".

As mentioned, we've already started this for our own use, but any ideas are welcome.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services