You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@superset.apache.org by gi...@git.apache.org on 2017/08/16 10:01:59 UTC

[GitHub] rhunwicks opened a new issue #3302: Create a PandasDatasource

rhunwicks opened a new issue #3302: Create a PandasDatasource
URL: https://github.com/apache/incubator-superset/issues/3302

- [X] I have checked the issue tracker for the same issue and I haven't found one similar

### Superset version

0.19.0

### Expected results

There are a large number of Issues asking about adding new Datasources / Connectors:

1. #381
1. #2790
1. #2468
1. #945
1. #241
1. #600
1. #245

Unfortunately, I can't find any examples of a working third party datasource / connector on Github and I think this is possibly because of the complexity and level of effort required to implement a BaseDatasource subclass with all the required methods. In particular, it needs to be able to report the schema and do filtering, grouping and aggregating.

Pandas has great import code, and I have seen Pandas proposed an a method for implementing a CSV connector - see #381 - read the CSV using Pandas and then output to sqlite and then connect to sqlite using the SQLA Datasource to create the slices.

This approach could be extended to other data formats that Pandas can read, e.g. Excel, HDF5, etc.

However, it is not ideal because the sqlite file will be potentially be out of date as soon as it is loaded.

I'd like to propose an altenative: a PandasDatasource that allows the user to specify the import method (`read_csv`, `read_table`, `read_hdf`, etc.) and a URL and which then queries the URL using the method to create a Dataframe. It reports the columns available and their types based on the dtypes for the Dataframe. And by default it allows grouping, filtering and aggregating using Pandas built in functionality.

I realize that this approach won't work for very large datasets that could overwhelm the memory of the server, but it would work for my use case and probably for many others. The results of the read, filter, group and aggregate would be cached anyway, so the large memory usage is potentially only temporary.

This would also make it very much easier for people working with larger datasets to create a custom connector to suit their purposes. For example, someone wanting to use BigQuery (see #945) could extend the PandasDatasource to use `read_gbq` and to pass the filter options through to BigQuery but still rely on Pandas to do grouping and aggregating. Given that starting point, someone else might come along later and add the additional code necessary to pass some group options through to BigQuery.

The point is that instead of having to write an entire Datasource and implement all methods, you could extend an existing one to scratch your particular itch, and over time as more itches get scratched we would end up with a much broader selection of datasources for Superset.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services