You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/05/21 21:14:47 UTC

[GitHub] [incubator-superset] rjurney opened a new issue #7571: [SIP] Proposal for Improving Examples Interface, Organization and Storage

rjurney opened a new issue #7571: [SIP] Proposal for Improving Examples Interface, Organization and Storage
URL: https://github.com/apache/incubator-superset/issues/7571
 
 
   # [SIP] Proposal for Improving Examples Interface, Organization and Storage
   The goal of the changes in this proposal is to improve the examples capabilities of Superset so as to foster an ecosystem of examples which will sustain and grow as the platform continues to develop. First I will characterize the existing system of examples and then propose changes to improve the number and quality of examples. 
   
   ## Current Examples
   Examples are currently programmatically defined in the `superset.data` module. An abstract interface summarizing these examples looks like the following:
   
   ```python
   class AbstractSupersetExample(ABC):
       """Defines interface through which superset examples load themselves."""
   
       def __init__(self, description):
           self.description = description
   
       def load_data(self):
           # Task 1: Load file and create pandas.DataFrame
           # Task 2: Load data into SQL with pandas.DataFrame.to_sql() 
           # Task 3: Process through ORM to get back workable Table object from whichever data source the table is in
           pass
   
       def create_metrics(self): 
           # Task 1: Build any TableColumns
           # Task 2: Build Metrics - SQLMetrics
           # Task 3: Store metrics in DB via ORM
           pass
       
       def create_charts(self, slices):
           # Task 1: Build Slice from config/JSON
           # Task 2: Store to DB via - misc_dash_slices.add(slc.slice_name) / merge_slice(slc)
           pass
       
       def create_dashboards(self, name, config):
           # Task 1: Instantiate Dash via ORM
           # Task 2: Configure Dash via JSON
           # Task 3: Store to DB via ORM
           pass
   ```
   
   While this mechanism jump started the collection of Superset examples, defining examples as code will not appeal to most Superset users of growing a community that contributes examples.
   
   ### Current Dashboard Export
   
   Dashboard example creation could utilize the export feature by adding example oriented fields to the export. Dashboards can be exported via the Dashboard List interface at `/dashboard/list/` via its Export action or at the command line via  `superset export_dashboards`. Dashboard export JSON includes everything needed to reproduce a dashboard save the actual data table: dashboard, slice and datasource information.
   
   ```json
   {
     "dashboards": [
       {
         "__Dashboard__": {
           "created_on": {
             "__datetime__": "2019-04-19T10:29:21"
           },
           "changed_by_fk": null,
           "slug": "world_health",
           "json_metadata": "{\"remote_id\": 1}",
           "description": null,
           "dashboard_title": "World's Bank Data",
           "changed_on": {
             "__datetime__": "2019-05-02T17:55:47"
           },
           "created_by_fk": null,
           "css": null,
           "position_json": { ... },
     "id": 1,
     "slices": [
       {
         "__Slice__": {
           "created_by_fk": null,
           "cache_timeout": null,
           "params": "{\"compare_lag\": \"10\", \"compare_suffix\": \"o10Y\", \"country_fieldtype\": \"cca3\", \"date_filter\": false, \"entity\": \"country_code\", \"filter_configs\": [{\"asc\": false, \"clearable\": true, \"column\": \"region\", \"key\": \"2s98dfu\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}, {\"asc\": false, \"clearable\": true, \"column\": \"country_name\", \"key\": \"li3j2lk\", \"metric\": \"sum__SP_POP_TOTL\", \"multiple\": true}], \"granularity_sqla\": \"year\", \"groupby\": [], \"limit\": \"25\", \"markup_type\": \"markdown\", \"metric\": \"sum__SP_POP_TOTL\", \"metrics\": [\"sum__SP_POP_TOTL\"], \"row_limit\": 50000, \"secondary_metric\": \"sum__SP_POP_TOTL\", \"show_bubbles\": true, \"since\": \"2014-01-01\", \"time_range\": \"2014-01-01 : 2014-01-02\", \"until\": \"2014-01-02\", \"viz_type\": \"filter_box\", \"where\": \"\", \"remote_id\": 371, \"datasource_name\": \"wb_health_population\", \"schema\": \"wb_health_population\", \"database_name\": \"main\"}",
           "datasource_name": null,
           "datasource_type": "table",
           "slice_name": "Region Filter",
           "changed_on": {
             "__datetime__": "2019-05-02T18:00:39"
           },
           "changed_by_fk": 1,
           "perm": "[main].[wb_health_population](id:2)",
           "description": null,
           "viz_type": "filter_box",
           "datasource_id": 2,
           "id": 371,
           "created_on": {
             "__datetime__": "2019-05-02T17:55:47"
           },
           "owners": []
         }
       }
     ],
     "datasources": [
       {
         "__SqlaTable__": {
           "created_on": {
             "__datetime__": "2019-04-19T10:29:20"
           },
           "sql": null,
           "cache_timeout": null,
           "changed_on": {
             "__datetime__": "2019-05-02T17:55:47"
           },
           "is_sqllab_view": false,
           "params": "{\"remote_id\": 2, \"database_name\": \"main\"}",
           "id": 2,
           "template_params": null,
           "perm": "[main].[wb_health_population](id:2)",
           "description": "<!--\nLicensed to the Apache Software Foundation...",
   		  "created_by_fk": null,
           "table_name": "wb_health_population",
           "default_endpoint": null,
           "changed_by_fk": null,
           "main_dttm_col": "year",
           "is_featured": false,
           "database_id": 1,
           "filter_select_enabled": true,
           "fetch_values_predicate": null,
           "offset": 0,
           "schema": null,
           "columns": [
             {
               "__TableColumn__": {
                 "created_by_fk": null,
                 "python_date_format": null,
                 "is_dttm": false,
                 "description": null,
                 "groupby": false,
                 "type": "FLOAT",
                 "verbose_name": null,
                 "id": 4,
                 "changed_by_fk": null,
                 "created_on": {
                   "__datetime__": "2019-04-19T10:29:20"
                 },
                 "table_id": 2,
                 "filterable": false,
                 "expression": "",
                 "database_expression": null,
                 "is_active": true,
                 "column_name": "NY_GNP_PCAP_CD",
                 "changed_on": {
                   "__datetime__": "2019-04-19T10:29:20"
                 }
               }
             },
   	  ...
   	],
           "metrics": [
             {
               "__SqlMetric__": {
                 "created_by_fk": null,
                 "table_id": 2,
                 "d3format": null,
                 "description": null,
                 "metric_name": "sum__SP_POP_TOTL",
                 "changed_on": {
                   "__datetime__": "2019-04-19T10:29:20"
                 },
                 "metric_type": null,
                 "changed_by_fk": null,
                 "expression": "sum(SP_POP_TOTL)",
                 "warning_text": null,
                 "is_restricted": false,
                 "verbose_name": null,
                 "id": 3,
                 "created_on": {
                   "__datetime__": "2019-04-19T10:29:20"
                 }
               }
             },
   	  ...
           ],
   	"database": {
             "__Database__": {
               "allow_csv_upload": true,
               "verbose_name": null,
               "created_by_fk": null,
               "allow_ctas": false,
               "database_name": "main",
               "changed_by_fk": null,
               "allow_dml": false,
               "sqlalchemy_uri": "sqlite:////Users/rjurney/.superset/superset.db",
               "force_ctas_schema": null,
               "password": null,
               "allow_multi_schema_metadata_fetch": false,
               "cache_timeout": null,
               "created_on": {
                 "__datetime__": "2019-04-19T10:28:59"
               },
               "extra": "{\n    \"metadata_params\": {},\n    \"engine_params\": {},\n    \"metadata_cache_timeout\": {},\n    \"schemas_allowed_for_csv_upload\": []\n}\n",
               "select_as_create_table_as": false,
               "perm": "[main].(id:1)",
               "changed_on": {
                 "__datetime__": "2019-04-19T10:28:59"
               },
               "expose_in_sqllab": true,
               "impersonate_user": false,
               "id": 1,
               "allow_run_async": false
             }
           }
         }
       },
   ```
   
   ## Example Components
   A Superset example is a SQL oriented dashboard and is composed of the following:
   
   * Physical database with tables loaded with a dataset
   * Superset Database object
   * Table object
   * Column objects
   * Metric objects
   * Dashboard object
   * One or more Slice objects (charts)
   
   All of the above will need to be serialized, stored, contributed, approved, listed, deserialized and loaded by the example system.
   
   ## Scope of Improvement
   This proposal improves the superset example process in three areas: example creation, data storage and discoverability.
   
   In order to improve the range and quality of Superset examples we need to first improve the process for creating and loading examples. While examples can be created programmatically, the more natural process is to use Superset to create them. This requires that we automate the process to persist and restore the combined state of the Superset Dashboard, Database and related objects as well as the contents of the datasource itself.
   
   We also need a directory to which examples can be uploaded and a corresponding user interface and process of governance over that repository. This directory should be independent of the Superset project release process and code repository. Current processes for management of Superset’s code assets would transition directly to the management of its examples: changes would be created by creating Github issues and pull requests, data assets would be versioned and managed in a central repository.
   
   Finally we need a user interface for finding, listing and loading examples from the repository. It should be simple and can exist as an `examples` command as part of the superset CLI which will have `create`, `list` , `load` and `remove` sub-commands.
   
   ## Example Repository Requirements
   The requirement for example storage are that it have the following properties:
   
   * Any user can view a list of all examples that have been published
   * Any user can upload an example and publish it - pending approval
   * Examples can be approved by committers by some mechanism
   * Approved examples can be published/released by committers by some mechanism
   * Management process/interface should match as closely as possible existing tools and practices.
   * The project is separate from Superset to enable more rapid and independent release cycles
   
   Git and Github are a desirable mechanism for publishing and approval but an undesirable mechanism for storage. Git LFS (Large File System) offers scalable storage while still using Github for project management. With a 2GB file limit and support on Github for 250 of these files, it scales well and is the proposed storage system. Other options are explored in the addendum.
   
   ### Example File Format
   
   Examples should be defined and packaged in a standard manner and each example should be self contained in its own file system directory. The existing `Dashboard` export format adequately describes a dashboard, it’s slices and the associated datasources but is missing human readable fields describing the contents of the dataset and dashboard as well as the physical location of the contents of the tables the datasource metadata describes. These fields will be added to the Superset dashboard export format.
   
   Data location information will be stored in a top level `files` key next to the existing `dashboards`, `slices` and `datasources` keys. A top level `description` field will fill out the fields of a description of the dataset in the examples directory. The existing `World’s Bank Data` example is extended below:
   
   ```json
   {
       “dashboards”: [ ... ],
       “datasources”: [ ... ],
       “description”: {
           "created_at": "2019-05-20T16:20:24.883125",
           "description": "World Bank Data example about world health populations from 1960-2010.",
           "file_count": 1,
           "license": "Apache 2.0"
           "title": "World Bank Health Information",
           "total_rows": 11770,
           "total_size": 22561353,
           "total_size_mb": 21.52,
       },
       “files”: [ 
           {
               "file_name": "wb_health_population.csv.gz",
               "rows": 11770,
               "size": 22561353,
               "table_name": "wb_health_population",
               "types": {
                   "NY_GNP_PCAP_CD": {
                       "__NullType__": {
                           "_has_column_expression": false
                       }
                   },
                   ...
               },
           }
       ],
       “slices”: [ ... ]
   }
   ```
   
   The file layout for this example appears as follows, with the dashboard slug used as the directory name in the exported tarball and examples directory:
   
   ```bash
   /
   /world_health
   /world_health/dashboard.json
   /world_health/wb_health_population.csv.gz
   ```
   
   ### New or Changed Public Interfaces
   
   Changes to the model classes as well as the CLI have been necessary to improve the examples capability of Superset.
   
   #### UUIDs via sqlalchemy.types.uuid.UUIDType
   
   In order to export or import assets in a way that doesn’t result in integer primary key chaos, we require that each serialized asset have a unique identifier. The `ImportMixin` class has been used to provide a `uuid` field to the following classes:
   
   * `Dashboard`
   * `Datasource`
   * `Database`
   * `DruidCluster`
   * `DruidMetric`
   * `Slice`
   * `SqlMetric`
   * `SqlaTable`
   * `TableColumn`
   * `DashboardEmailSchedule`
   * `SliceEmailSchedule`
   
   This required patching FlaskAppBuilder to support UUIDType as a String field type.
   
   #### Command Line Interface
   
   Example capabilities will be accessed via the command line (CLI) interface. The CLI will be changed, removing the `load_examples` command and replacing it with an `examples` subcommand with `create`,  `list`, `load` and `remove` commands beneath it.
   
   ##### Top Level CLI Menu
   
   `superset —help`
   
   ```bash
   Usage: superset [OPTIONS] COMMAND [ARGS]...
   
     This is a management script for the Superset application.
   
   Options:
     --version  Show the flask version
     --help     Show this message and exit.
   
   Commands:
     db                        Perform database migrations.
     examples                  Manages example Slices/Dashboards/datasets
     ...
   ```
   
   ##### Dashboard Exports Menu
   
   The dashboard exports menu will be extended to add the `—dashboard-titles/-t`, `—export-data/-x` and `—export-data-dir/-d` options which facilitate example export.
   
   `superset export_dashboards --help`
   
   ```bash
   Usage: superset export_dashboards [OPTIONS]
   
     Export dashboards to JSON
   
   Options:
     -f, --dashboard-file TEXT    Specify the the file to export to
     -p, --print_stdout           Print JSON to stdout
     -i, --dashboard-ids INTEGER  Specify dashboard id to export
     -t, --dashboard-titles TEXT  Specify dashboard title to export
     -x, --export-data            Export the dashboard's data tables as CSV
                                  files.
     -d, --export-data-dir TEXT   Specify export directory path. Defaults to
                                  '/tmp'.
     --help                       Show this message and exit.
   ```
   
   ##### Examples Top Level Menu
   
   `superset examples —-help`
   
   ```bash
   Usage: superset examples [OPTIONS] COMMAND [ARGS]...
   
     Manages example Slices/Dashboards/datasets
   
   Options:
     --help  Show this message and exit.
   
   Commands:
     create  Create example Slice/Dashboard/datasets
     list    List example Slices/Dashboards/datasets
     load    Load an example Slice/Dashboard/dataset
     remove  Remove an example Slice/Dashboard/dataset
   ```
   
   ##### Example Creation Menu
   
   `superset examples create --help`
   
   ```bash
   Usage: superset examples create [OPTIONS]
   
     Create example Slice/Dashboard/datasets
   
   Options:
     -i, --dashboard-id INTEGER  Specify dashboard id to export
     -t, --dashboard-title TEXT  Specify dashboard title to export
     -d, --description TEXT      Description of new example  [required]
     -e, --example-title TEXT    Title for new example  [required]
     -f, --file-name TEXT        Specify export file name. Defaults to
                                 dashboard.tar.gz
     -l, --license TEXT          License of the example dashboard
     --help                      Show this message and exit.
   ```
   
   ##### Examples List Menu
   
   `superset examples list —-help`
   
   ```bash
   Usage: superset examples list [OPTIONS]
   
     List example Slices/Dashboards/datasets
   
   Options:
     -r, --examples-revision TEXT  Revision of examples to list
     --help                        Show this message and exit.
   ```
   
   ##### Examples Load Menu
   
   `superset examples load --help`
   
   ```bash
   Usage: superset examples load [OPTIONS]
   
     Load an example Slice/Dashboard/dataset
   
   Options:
     -e, --example-title TEXT  Title of example to load  [required]
     -d, --database-uri TEXT   Database URI to load example to
     --help                    Show this message and exit.
   ```
   
   ##### Examples Remove Menu
   
   `superset examples remove --help`
   
   ```bash
   Usage: superset examples remove [OPTIONS]
   
     Remove an example Slice/Dashboard/dataset
   
   Options:
     -e, --example-title TEXT  Title of example to remove  [required]
     -d, --database-uri TEXT   Database URI to load example to
     --help                    Show this message and exit.
   ```
   
   ### New Dependencies
   
   Creating, removing, listing and loading examples can be handled without Git LFS but adding examples to the `examples-data` Github repository will require it. This is a developer only requirement of the `examples-data` project and not Superset itself. 
   
   Git LFS can be [installed](https://github.com/git-lfs/git-lfs/wiki/Installation) via:
   
   ```bash
   # OS X
   brew install git-lfs
   # port install git-lfs
   git lfs install
   ```
   
   ```bash
   # Ubuntu
   curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
   sudo apt-get install git-lfs
   git lfs install
   ```
   
   ### Migration Plan and Compatibility
   
   Backwards compatibility will be maintained. Existing dashboard export JSON files will continue to work and all existing dashboard examples and their data files will be ported to the new system and stored in the Superset examples repository.
   
   #### Cross-Repository Management
   
   As both the examples and Superset evolve, some will work with newer versions of Superset than others. We must strive to keep all of them up to date, but should also try to make them backwards compatible. It will thus be inevitable that `incubator-superset` releases will have to point to a branch/tag of `superset-examples`. 
   
   A given release of superset must reference a certain version of the `examples-data` repository. This is achieved via the `EXAMPLES_GIT_TAG` configuration key. Alternatively, this could be a branch rather than a tag to facilitate the ongoing update of examples.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org