You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/08/19 22:06:56 UTC
[GitHub] [incubator-superset] robdiciuccio opened a new pull request #8069: [SQL Lab] Async query results serialization with MessagePack and PyArrow

robdiciuccio opened a new pull request #8069: [SQL Lab] Async query results serialization with MessagePack and PyArrow
URL: https://github.com/apache/incubator-superset/pull/8069
 
 
   ### CATEGORY
   
   Choose one
   
   - [ ] Bug Fix
   - [x] Enhancement (new features, refinement)
   - [ ] Refactor
   - [ ] Add tests
   - [ ] Build / Development Environment
   - [ ] Documentation
   
   ### SUMMARY
   <!--- Describe the change below, including rationale and design decisions -->
   Async query performance in SQL Lab, particularly with large result sets, is fairly poor due to how data is serialized and stored. This PR introduces a `RESULTS_BACKEND_USE_MSGPACK` config option to use [PyArrow](https://arrow.apache.org/docs/python/) to serialize the pandas DataFrame directly, and [MessagePack](https://github.com/msgpack/msgpack-python) for serializing the results payload. Compared to the existing JSON serialization, Arrow and MessagePack provide improved performance and result in much smaller payloads sent to S3 or other cache backends.
   
   ### Benchmarks with 100K rows of `birth_names` examples data (multiple runs)
   
   #### JSON
   
   avg serialization: 573ms
   avg deserialization: 200ms
   total: 773ms
   compressed payload size: 816718
   peak memory usage: 281.1 MiB
   
   #### Arrow/msgpack
   
   avg serialization: 70ms
   avg deserialization: 40ms
   total: 110ms
   compressed payload size: 452634
   peak memory usage: 266.2 MiB
   
   _Benchmarks were performed on a Macbook Pro 2.6 GHz i7, 32GB running macOS 10.14.5 and Python 3.6.8. [memory-profiler](https://pypi.org/project/memory-profiler/) was used for memory usage stats._
   
   ### TEST PLAN
   <!--- What steps should be taken to verify the changes -->
   Testing thus far has been limited to Postgres, mainly on Superset examples data. For full compatibility testing:
   - Enable `RESULTS_BACKEND_USE_MSGPACK = True` in `superset_config.py`
   - Run queries in SQL Lab with various DB backends containing multiple data types
   - Ensure displayed results and CSV downloads contain correctly formatted data
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Changes UI
   - [ ] Requires DB Migration.
   - [ ] Confirm DB Migration upgrade and downgrade tested.
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   
   ### REVIEWERS
   @mistercrunch 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org