You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/08/14 17:41:00 UTC

[jira] [Updated] (MADLIB-1379) Fix madpack slowness by using a single session

     [ https://issues.apache.org/jira/browse/MADLIB-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank McQuillan updated MADLIB-1379:
------------------------------------
    Fix Version/s: v1.17

> Fix madpack slowness by using a single session
> ----------------------------------------------
>
>                 Key: MADLIB-1379
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1379
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Madpack
>            Reporter: Domino Valdano
>            Assignee: Domino Valdano
>            Priority: Major
>             Fix For: v1.17
>
>
> In the current implementation of madpack, the python function for sending queries to the database, `run_query` is causing extreme slowness on some platforms, especially during startup and shutdown.
> For example, on a MacBookPro with OSX 16.7.0 and a minimal 3-segment Greenplum demo cluster:
> ```
> > time src/bin/madpack -p greenplum -c /madlib install-check -t nosuchmodule
> madpack.py: INFO : Detected Greenplum DB version 5.10.0.
> real 0m46.365s
> user 0m0.207s
> sys 0m0.136s
> ```
> It takes madpack 46 seconds just to figure out that there is no module named `nosuchmodule` in madlib, so it can't run install-check on it.
> It also takes about 46 seconds to run a single install-check sql file that *does* exist but only contains the line `SELECT 1`.  The general pattern is that `madpack` will hang for a long time, then run `install-check` or `dev-check`, report PASS or FAIL's, then hang for a long time again, and then finally exit.
> This makes it almost unusable for rapid iterative development, for example when writing test files and checking whether they still PASS when minor changes are made.
> I'm not sure how many platforms are affected by this slowdown, but I have noticed it happens on my MacBook Pro's with greenplum but not postgres.
> The main reason for the slowness is that there are a series of setup sql commands issued one by one to `run_query`, and for each command (`DROP USER`, `CREATE USER`, GRANT USAGE`, `GRANT ALL`, etc.).  Instead of using a single connection to the db, each call to `run_query` opens a new connection and then closes it.  In particular, it spawns an entirely new child process, redirects the input and output streams, constructs a `psql` command filled with all the right command-line options, just to execute the single line of sql.  
> On `postgres`, may be a little inefficient, but it doesn't matter too much because there is just a single machine with a single instance running.  Probably slows it down a little, but hardly noticeable. 
> However, on `greenplum` the situation is far worse.  Each time `run_query` creates a new `psql` client session, it first connects to the master postgres instance.  The master spawns a new child postgres process to handle that particular client.  It then signals all of the segments to create new child postgres processes on all the remote hosts to also handle that specific client.  This involves sending encrypted network traffic back and forth between every host in the cluster over an ssh connection, followed by each of those hosts spawning as many new postgres processes as there are segments per host.
> The 46-second delay I've seen is for only a single machine with 1 master and 3 segments, so I think that means we could expect delays of several minutes or more on multi-host clusters.
> I have verified that with greenplum on OSX, each time I open `psql` there is a 3 second delay during the first sql command issued that involves segments, but no delay for any commands that come after that since all of the extra postgres processes have already been created.  Because of how `run_query` is written, that necessitates a `3-second` delay for every SQL command issued, and it issues at least a dozen or so, only one of them being the actual execution of the test script.
> An easy solution to this is to rewrite `run_query` so that it uses a single client connection instead of spawning a new client in a new child process for each query.  This is only a few lines using `psycopg`.  But if we don't want to have that as a requirement, we could have it start running `psql` the first time, but save the input and output file descriptors so that with subsequent queries it can send data back and forth to psql.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)