You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/05/21 21:14:47 UTC
[GitHub] [incubator-superset] rjurney opened a new issue #7571: [SIP]
Proposal for Improving Examples Interface, Organization and Storage
rjurney opened a new issue #7571: [SIP] Proposal for Improving Examples Interface, Organization and Storage
URL: https://github.com/apache/incubator-superset/issues/7571
# [SIP] Proposal for Improving Examples Interface, Organization and Storage
The goal of the changes in this proposal is to improve the examples capabilities of Superset so as to foster an ecosystem of examples which will sustain and grow as the platform continues to develop. First I will characterize the existing system of examples and then propose changes to improve the number and quality of examples.
## Current Examples
Examples are currently programmatically defined in the `superset.data` module. An abstract interface summarizing these examples looks like the following:
```python
class AbstractSupersetExample(ABC):
"""Defines interface through which superset examples load themselves."""
def __init__(self, description):
self.description = description
def load_data(self):
# Task 1: Load file and create pandas.DataFrame
# Task 2: Load data into SQL with pandas.DataFrame.to_sql()
# Task 3: Process through ORM to get back workable Table object from whichever data source the table is in
pass
def create_metrics(self):
# Task 1: Build any TableColumns
# Task 2: Build Metrics - SQLMetrics
# Task 3: Store metrics in DB via ORM
pass
def create_charts(self, slices):
# Task 1: Build Slice from config/JSON
# Task 2: Store to DB via - misc_dash_slices.add(slc.slice_name) / merge_slice(slc)
pass
def create_dashboards(self, name, config):
# Task 1: Instantiate Dash via ORM
# Task 2: Configure Dash via JSON
# Task 3: Store to DB via ORM
pass
```
While this mechanism jump started the collection of Superset examples, defining examples as code will not appeal to most Superset users of growing a community that contributes examples.
### Current Dashboard Export
Dashboard example creation could utilize the export feature by adding example oriented fields to the export. Dashboards can be exported via the Dashboard List interface at `/dashboard/list/` via its Export action or at the command line via `superset export_dashboards`. Dashboard export JSON includes everything needed to reproduce a dashboard save the actual data table: dashboard, slice and datasource information.
```json
{
"dashboards": [
{
"__Dashboard__": {
"created_on": {
"__datetime__": "2019-04-19T10:29:21"
},
"changed_by_fk": null,
"slug": "world_health",
"json_metadata": "{\"remote_id\": 1}",
"description": null,
"dashboard_title": "World's Bank Data",
"changed_on": {
"__datetime__": "2019-05-02T17:55:47"
},
"created_by_fk": null,
"css": null,
"position_json": { ... },
"id": 1,
"slices": [
{
"__Slice__": {
"created_by_fk": null,
"cache_timeout": null,
"params": "{\"compare_lag\": \"10\", \"compare_suffix\": \"o10Y\", \"country_fieldtype\": \"cca3\", \"date_filter\": false, \"entity\": \"country_code\", \"filter_configs\": [{\"asc\": false, \"clearable\": true, \"column\": \"region\", \"key\": \"2s98dfu\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}, {\"asc\": false, \"clearable\": true, \"column\": \"country_name\", \"key\": \"li3j2lk\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}], \"granularity_sqla\": \"year\", \"groupby\": [], \"limit\": \"25\", \"markup_type\": \"markdown\", \"metric\": \"sum__SP_POP_TOTL\", \"metrics\": [\"sum__SP_POP_TOTL\"], \"row_limit\": 50000, \"secondary_metric\": \"sum__SP_POP_TOTL\", \"show_bubbles\": true, \"since\": \"2014-01-01\", \"time_range\": \"2014-01-01 : 2014-01-02\", \"until\": \"2014-01-02\", \"viz_type\": \"filter_box\", \"where\": \"\", \"remote_id\": 371, \"datasource_name\": \"wb_health_population\", \"schema\": \"wb_health_population\", \"database_name\": \"main\"}",
"datasource_name": null,
"datasource_type": "table",
"slice_name": "Region Filter",
"changed_on": {
"__datetime__": "2019-05-02T18:00:39"
},
"changed_by_fk": 1,
"perm": "[main].[wb_health_population](id:2)",
"description": null,
"viz_type": "filter_box",
"datasource_id": 2,
"id": 371,
"created_on": {
"__datetime__": "2019-05-02T17:55:47"
},
"owners": []
}
}
],
"datasources": [
{
"__SqlaTable__": {
"created_on": {
"__datetime__": "2019-04-19T10:29:20"
},
"sql": null,
"cache_timeout": null,
"changed_on": {
"__datetime__": "2019-05-02T17:55:47"
},
"is_sqllab_view": false,
"params": "{\"remote_id\": 2, \"database_name\": \"main\"}",
"id": 2,
"template_params": null,
"perm": "[main].[wb_health_population](id:2)",
"description": "<!--\nLicensed to the Apache Software Foundation...",
"created_by_fk": null,
"table_name": "wb_health_population",
"default_endpoint": null,
"changed_by_fk": null,
"main_dttm_col": "year",
"is_featured": false,
"database_id": 1,
"filter_select_enabled": true,
"fetch_values_predicate": null,
"offset": 0,
"schema": null,
"columns": [
{
"__TableColumn__": {
"created_by_fk": null,
"python_date_format": null,
"is_dttm": false,
"description": null,
"groupby": false,
"type": "FLOAT",
"verbose_name": null,
"id": 4,
"changed_by_fk": null,
"created_on": {
"__datetime__": "2019-04-19T10:29:20"
},
"table_id": 2,
"filterable": false,
"expression": "",
"database_expression": null,
"is_active": true,
"column_name": "NY_GNP_PCAP_CD",
"changed_on": {
"__datetime__": "2019-04-19T10:29:20"
}
}
},
...
],
"metrics": [
{
"__SqlMetric__": {
"created_by_fk": null,
"table_id": 2,
"d3format": null,
"description": null,
"metric_name": "sum__SP_POP_TOTL",
"changed_on": {
"__datetime__": "2019-04-19T10:29:20"
},
"metric_type": null,
"changed_by_fk": null,
"expression": "sum(SP_POP_TOTL)",
"warning_text": null,
"is_restricted": false,
"verbose_name": null,
"id": 3,
"created_on": {
"__datetime__": "2019-04-19T10:29:20"
}
}
},
...
],
"database": {
"__Database__": {
"allow_csv_upload": true,
"verbose_name": null,
"created_by_fk": null,
"allow_ctas": false,
"database_name": "main",
"changed_by_fk": null,
"allow_dml": false,
"sqlalchemy_uri": "sqlite:////Users/rjurney/.superset/superset.db",
"force_ctas_schema": null,
"password": null,
"allow_multi_schema_metadata_fetch": false,
"cache_timeout": null,
"created_on": {
"__datetime__": "2019-04-19T10:28:59"
},
"extra": "{\n \"metadata_params\": {},\n \"engine_params\": {},\n \"metadata_cache_timeout\": {},\n \"schemas_allowed_for_csv_upload\": []\n}\n",
"select_as_create_table_as": false,
"perm": "[main].(id:1)",
"changed_on": {
"__datetime__": "2019-04-19T10:28:59"
},
"expose_in_sqllab": true,
"impersonate_user": false,
"id": 1,
"allow_run_async": false
}
}
}
},
```
## Example Components
A Superset example is a SQL oriented dashboard and is composed of the following:
* Physical database with tables loaded with a dataset
* Superset Database object
* Table object
* Column objects
* Metric objects
* Dashboard object
* One or more Slice objects (charts)
All of the above will need to be serialized, stored, contributed, approved, listed, deserialized and loaded by the example system.
## Scope of Improvement
This proposal improves the superset example process in three areas: example creation, data storage and discoverability.
In order to improve the range and quality of Superset examples we need to first improve the process for creating and loading examples. While examples can be created programmatically, the more natural process is to use Superset to create them. This requires that we automate the process to persist and restore the combined state of the Superset Dashboard, Database and related objects as well as the contents of the datasource itself.
We also need a directory to which examples can be uploaded and a corresponding user interface and process of governance over that repository. This directory should be independent of the Superset project release process and code repository. Current processes for management of Superset’s code assets would transition directly to the management of its examples: changes would be created by creating Github issues and pull requests, data assets would be versioned and managed in a central repository.
Finally we need a user interface for finding, listing and loading examples from the repository. It should be simple and can exist as an `examples` command as part of the superset CLI which will have `create`, `list` , `load` and `remove` sub-commands.
## Example Repository Requirements
The requirement for example storage are that it have the following properties:
* Any user can view a list of all examples that have been published
* Any user can upload an example and publish it - pending approval
* Examples can be approved by committers by some mechanism
* Approved examples can be published/released by committers by some mechanism
* Management process/interface should match as closely as possible existing tools and practices.
* The project is separate from Superset to enable more rapid and independent release cycles
Git and Github are a desirable mechanism for publishing and approval but an undesirable mechanism for storage. Git LFS (Large File System) offers scalable storage while still using Github for project management. With a 2GB file limit and support on Github for 250 of these files, it scales well and is the proposed storage system. Other options are explored in the addendum.
### Example File Format
Examples should be defined and packaged in a standard manner and each example should be self contained in its own file system directory. The existing `Dashboard` export format adequately describes a dashboard, it’s slices and the associated datasources but is missing human readable fields describing the contents of the dataset and dashboard as well as the physical location of the contents of the tables the datasource metadata describes. These fields will be added to the Superset dashboard export format.
Data location information will be stored in a top level `files` key next to the existing `dashboards`, `slices` and `datasources` keys. A top level `description` field will fill out the fields of a description of the dataset in the examples directory. The existing `World’s Bank Data` example is extended below:
```json
{
“dashboards”: [ ... ],
“datasources”: [ ... ],
“description”: {
"created_at": "2019-05-20T16:20:24.883125",
"description": "World Bank Data example about world health populations from 1960-2010.",
"file_count": 1,
"license": "Apache 2.0"
"title": "World Bank Health Information",
"total_rows": 11770,
"total_size": 22561353,
"total_size_mb": 21.52,
},
“files”: [
{
"file_name": "wb_health_population.csv.gz",
"rows": 11770,
"size": 22561353,
"table_name": "wb_health_population",
"types": {
"NY_GNP_PCAP_CD": {
"__NullType__": {
"_has_column_expression": false
}
},
...
},
}
],
“slices”: [ ... ]
}
```
The file layout for this example appears as follows, with the dashboard slug used as the directory name in the exported tarball and examples directory:
```bash
/
/world_health
/world_health/dashboard.json
/world_health/wb_health_population.csv.gz
```
### New or Changed Public Interfaces
Changes to the model classes as well as the CLI have been necessary to improve the examples capability of Superset.
#### UUIDs via sqlalchemy.types.uuid.UUIDType
In order to export or import assets in a way that doesn’t result in integer primary key chaos, we require that each serialized asset have a unique identifier. The `ImportMixin` class has been used to provide a `uuid` field to the following classes:
* `Dashboard`
* `Datasource`
* `Database`
* `DruidCluster`
* `DruidMetric`
* `Slice`
* `SqlMetric`
* `SqlaTable`
* `TableColumn`
* `DashboardEmailSchedule`
* `SliceEmailSchedule`
This required patching FlaskAppBuilder to support UUIDType as a String field type.
#### Command Line Interface
Example capabilities will be accessed via the command line (CLI) interface. The CLI will be changed, removing the `load_examples` command and replacing it with an `examples` subcommand with `create`, `list`, `load` and `remove` commands beneath it.
##### Top Level CLI Menu
`superset —help`
```bash
Usage: superset [OPTIONS] COMMAND [ARGS]...
This is a management script for the Superset application.
Options:
--version Show the flask version
--help Show this message and exit.
Commands:
db Perform database migrations.
examples Manages example Slices/Dashboards/datasets
...
```
##### Dashboard Exports Menu
The dashboard exports menu will be extended to add the `—dashboard-titles/-t`, `—export-data/-x` and `—export-data-dir/-d` options which facilitate example export.
`superset export_dashboards --help`
```bash
Usage: superset export_dashboards [OPTIONS]
Export dashboards to JSON
Options:
-f, --dashboard-file TEXT Specify the the file to export to
-p, --print_stdout Print JSON to stdout
-i, --dashboard-ids INTEGER Specify dashboard id to export
-t, --dashboard-titles TEXT Specify dashboard title to export
-x, --export-data Export the dashboard's data tables as CSV
files.
-d, --export-data-dir TEXT Specify export directory path. Defaults to
'/tmp'.
--help Show this message and exit.
```
##### Examples Top Level Menu
`superset examples —-help`
```bash
Usage: superset examples [OPTIONS] COMMAND [ARGS]...
Manages example Slices/Dashboards/datasets
Options:
--help Show this message and exit.
Commands:
create Create example Slice/Dashboard/datasets
list List example Slices/Dashboards/datasets
load Load an example Slice/Dashboard/dataset
remove Remove an example Slice/Dashboard/dataset
```
##### Example Creation Menu
`superset examples create --help`
```bash
Usage: superset examples create [OPTIONS]
Create example Slice/Dashboard/datasets
Options:
-i, --dashboard-id INTEGER Specify dashboard id to export
-t, --dashboard-title TEXT Specify dashboard title to export
-d, --description TEXT Description of new example [required]
-e, --example-title TEXT Title for new example [required]
-f, --file-name TEXT Specify export file name. Defaults to
dashboard.tar.gz
-l, --license TEXT License of the example dashboard
--help Show this message and exit.
```
##### Examples List Menu
`superset examples list —-help`
```bash
Usage: superset examples list [OPTIONS]
List example Slices/Dashboards/datasets
Options:
-r, --examples-revision TEXT Revision of examples to list
--help Show this message and exit.
```
##### Examples Load Menu
`superset examples load --help`
```bash
Usage: superset examples load [OPTIONS]
Load an example Slice/Dashboard/dataset
Options:
-e, --example-title TEXT Title of example to load [required]
-d, --database-uri TEXT Database URI to load example to
--help Show this message and exit.
```
##### Examples Remove Menu
`superset examples remove --help`
```bash
Usage: superset examples remove [OPTIONS]
Remove an example Slice/Dashboard/dataset
Options:
-e, --example-title TEXT Title of example to remove [required]
-d, --database-uri TEXT Database URI to load example to
--help Show this message and exit.
```
### New Dependencies
Creating, removing, listing and loading examples can be handled without Git LFS but adding examples to the `examples-data` Github repository will require it. This is a developer only requirement of the `examples-data` project and not Superset itself.
Git LFS can be [installed](https://github.com/git-lfs/git-lfs/wiki/Installation) via:
```bash
# OS X
brew install git-lfs
# port install git-lfs
git lfs install
```
```bash
# Ubuntu
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install
```
### Migration Plan and Compatibility
Backwards compatibility will be maintained. Existing dashboard export JSON files will continue to work and all existing dashboard examples and their data files will be ported to the new system and stored in the Superset examples repository.
#### Cross-Repository Management
As both the examples and Superset evolve, some will work with newer versions of Superset than others. We must strive to keep all of them up to date, but should also try to make them backwards compatible. It will thus be inevitable that `incubator-superset` releases will have to point to a branch/tag of `superset-examples`.
A given release of superset must reference a certain version of the `examples-data` repository. This is achieved via the `EXAMPLES_GIT_TAG` configuration key. Alternatively, this could be a branch rather than a tag to facilitate the ongoing update of examples.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org