You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/11/20 21:27:45 UTC

[GitHub] [incubator-superset] willbarrett commented on issue #8574: [SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections

willbarrett commented on issue #8574: [SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections
URL: https://github.com/apache/incubator-superset/issues/8574#issuecomment-556419994

Yes, there is a challenge around each process getting its own pool. I see your point that for some systems (Redshift is getting my side-eye here) having connections held open could be a larger problem. I'll amend the recommendation above to make one of the options a NullPool. This would retain the existing behavior for databases that are unable to handle enough open connections.

RE: the cost of spawning/destroying the connection, I think it's impossible to come up with a really solid specific number. I think the range is likely to be between around 10 milliseconds in the case where the server is a more application-focused one (Postgres, MySQL, etc.) living on the same network up to potentially multiple seconds for systems separated geographically or with a chattier protocol for initiating a database connection. Under load, these numbers can get quite large.

A goal down the line would be to limit total connections, but I'd like to push that off into a future SIP. I believe a reasonable way to attack that would be to implement a custom pool that leverages Redis as a distributed lock for the system. The Redis lookups for this system will potentially add a fair amount of latency, so that's something we should discuss separately in my mind.

RE: the intricacies of SQLAlchemy in a threaded environment, it appears that connection pools and engines are safe, but that sessions and connections are not. This makes sense intellectually - the connection pool and engine are designed to protect the non-thread-safe resources they contain. None of this is safe across a process boundary, so the multiprocessing module in Python is a danger to connection pools. We already have this issue when it comes to the metadata database. Post-fork, any existing connection pools would need to be recreated. Some database engines implement database connections as blocking calls from my research, which will break multithreading due to the Global Interpreter Lock. I think for us to really achieve the best throughput we will want to use lazy-loaded connections from process-based Celery workers that then become long-running connections. This, however, is multiple SIPs away, and I anticipate that we will need to retain the ability to run all queries in the foreground for the foreseeable future.

RE: is it worth it? I think that depends heavily on the workload. In terms of freeing up processor cycles on the web server, it could be very worth it. If there is a substantial geographical separation between Superset and the database accepting the connection, or if connections are slow to instantiate on that server, it will definitely be very worth it. I think providing the option of connection pooling could greatly accelerate certain workloads, though you have convinced me that retaining the option of a NullPool is a wise choice.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org