You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/11/14 23:31:19 UTC

[GitHub] [incubator-superset] willbarrett opened a new issue #8574: [SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections

willbarrett opened a new issue #8574: [SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections
URL: https://github.com/apache/incubator-superset/issues/8574
 
 
   ## [SIP] Proposal for Implementing Connection Pooling for Analytics Database Connections
   
   ### Motivation
   
   Currently, Superset’s connections to analytics databases do not have long-lived connection pools. In most instances, a database connection is spawned immediately before a query is executed and discarded after a single use. This introduces a small amount of latency into every query. While most queries run against data warehouses are expected to be longer-running than a typical web application query, this latency will be noticeable when performing operations such as loading schema and table lists for display in the UI, or loading table definitions and previews. 
   A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads available to the application across all processes. Under peak load, this can lead to hammering databases with a large number of connection requests and queries simultaneously. This does not allow us to provide meaningful upper bounds for the number of available database connections. Implementing connection pooling at the process level will allow us to provide a configurable maximum number of connections that Superset is able to leverage.
   
   ### Proposed Change
   
   I recommend we add a singleton object to hold a SQLAlchemy Engine instance for each configured database in the application. I believe that engines should not be instantiated on startup, but instead instantiated on first use to avoid unnecessary connection negotiation.
   
   I further recommend that we use the [SQLAlchemy QueuePool](https://docs.sqlalchemy.org/en/13/core/pooling.html#sqlalchemy.pool.QueuePool) as the pool implementation. I would like to make the `pool_size` and `max_overflow` properties configurable, as well as whether to treat the queue as FIFO or LIFO and the `pool_pre_ping` option, and customization of the `connect_args` passed on engine instantiation (which controls things like connection timeouts). I believe that LIFO queues will be preferable for infrequently-accessed database connections, as they will generally maintain a lower number of connections in the pool, and thus should be the default. I would also recommend that for LIFO queues we default to the `pool_pre_ping` option to trigger pool member invalidation when necessary, as stale connections are more likely under the LIFO configuration.
   
   As part of this work, I recommend moving engine instantiation code out of the Database model and into its own module, probably as part of the singleton that will maintain an in-memory list of database pools. We will need to update the code that alters database records to reinitialize the processes’ engine after Database record creation and update.
   
   One further change will be in regards to Celery’s connection pooling. Right now, we use the NullPool in Celery and instantiate database connections when needed. For Celery, I would recommend moving to the StaticPool, which will create one database connection per worker process. Because Celery reuses worker processes, this will reduce the overhead on backgrounded queries. An alternative would be to move to threaded workers (gevent or eventlet) and maintain the same pool configuration as the UI. I’d love suggestions from the community on what to recommend here.
   
   ### New or Changed Public Interfaces
   
   This change should have minimal impact on the UI, the primary change being the addition of more configuration options in the Databases section. I would recommend having sensible defaults and hiding the pool setup under an `Advanced` configuration section. I plan to provide guidance on the meaning of the `pool_size`, `max_overflow`, and FIFO vs LIFO configuration parameters, both in the UI and in new documentation.
   
   ### New dependencies
   
   No additional dependencies will be necessary.
   
   ### Migration Plan and Compatibility
   
   A database migration will be necessary to add additional fields for `pool_size`, `max_overflow`, `pool_use_lifo`, `pool_pre_ping`, and `connect_args` to the `dbs` table.
   
   No URLs will change as part of this work. I would like feedback from the community, particularly engineers at Airbnb, Lyft, and other organizations with large Superset installs, on what sensible defaults for connection pools would look like.
   
   ### Rejected Alternatives
   
   The primary alternative rejected is the current, connection-pool-less state. While this state allows for only the number of connections needed at any given time to be in use, it falls down with regards to performance and predictability of number of open connections at any given time.
   
   I also considered the other connection pool implementations in SQLAlchemy, but it appears that our use-case is best served by the QueuePool implementation.
   One additional piece I considered was providing an option to the user of configuring an overall, rather than per-process, maximum number of connections. In that case, processes would need to “check out” the ability to make a connection from a distributed lock built in Redis, or the max size would need to be large enough to provide at least one connection per live process. While I think this would be a better experience for most users, I’m concerned about the additional application complexity required by such a change. Would processes need to register themselves in Redis on boot so we could get a correct count of the number of live processes? What happens when we need to scale up beyond the global maximum number of database connections? I think solving those problems is not easy, and most use-cases will be well-enough served by a per-process max number of connections.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org