You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Don Poitras <Do...@sas.com> on 2019/08/11 13:52:58 UTC

RE: [PATCH 63503] - Reverse proxy server - SIGSEGV

Hello,
  I see that proxy_util.c was updated a few days ago. Would it be considered a faux-pas to contact the developer directly to request a review of this patch? It's been almost 2 months and I'm worried it will just get lost in the noise.

-- 
Don Poitras - Host R&D    SAS Institute Inc.            SAS Campus Drive 
mailto:sasdtp@sas.com     (919)531-5637  Fax:677-4444   Cary, NC 27513

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
On Sun, Aug 11, 2019 at 9:53 AM Don Poitras <Do...@sas.com> wrote:
>
> Hello,
>   I see that proxy_util.c was updated a few days ago. Would it be considered a faux-pas to contact the developer directly to request a review of this patch? It's been almost 2 months and I'm worried it will just get lost in the noise.

Probably not so effective but following up here is OK.

I am a bit curious if what's really happening on your system is a
failure in an existing mutex that goes unreported, because it seems
that the new use of the thread mutex is redundant due to either the
global mutex that's held or the mutex used internally by
apr_reslist_*().

IIUC the unlocked access to the worker->cp in e.g.
ap_proxy_acquire_connection() is a path only used in single-threaded
MPMS.  If you still have your testbed, it would be interesting to see
which lock is necessary and if any of the existing mutexes are
silently failing.

--
Eric Covener
covener@gmail.com

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
On Tue, Aug 13, 2019 at 3:12 PM Don Poitras <Do...@sas.com> wrote:

> You also moved the worker mutex unlock so it only happens if worker->s->hmax != NULL

Sorry, that was a bad edit, I refreshed the patch.

RE: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Don Poitras <Do...@sas.com>.
Eric,
  It's in the works. This involves several other people as the large server is owned by another group and getting our backend software ready to be called from httpd is somewhat involved. The build needs to be done by someone else as well. I've requested they build with --enable-pool-concurrency-check as that's what allows me to see which two threads are stepping on each other. I looked at your change and I see that you changed the logic so that the worker mutex is set for the first call to ap_proxy_initialize_worker(). I don't think that hurts anything and probably makes it easier to read, but I don't think it has an effect as in the first call (the one that allocates the pool), there is no way for any other thread to be using the pool. You also moved the worker mutex unlock so it only happens if worker->s->hmax != NULL and leave the two calls to connection_constructor() un-serialized. I don't have in my notes definite proof that this function must be serialized, so hopefully the test will show if this is ok. However, even in the case where worker->s->hmax == NULL, we need to unlock the worker mutex before returning. If this shows as a problem in the test, I can make that small change and run again and we should also put back the unlock call in the error condition for init_conn_pool(), as we shouldn't leave the mutex locked if we error out of this routine.
  Also, when this is finalized, all the new calls to APLOGNO need to go through your utility to assign message numbers.

> -----Original Message-----
> From: Eric Covener <co...@gmail.com>
> Sent: Monday, August 12, 2019 3:25 PM
> To: Apache HTTP Server Development List <de...@httpd.apache.org>
> Subject: Re: [PATCH 63503] - Reverse proxy server - SIGSEGV
> 
> Hi Don, can you try this very similar patch? I applied yours to my
> sandbox to get more context and made a few minor changes (including
> pre-existing stuff that looked misleading)
> 
> http://people.apache.org/~covener/trunk-proxy-segv.diff

-- 
Don Poitras - Host R&D    SAS Institute Inc.            SAS Campus Drive 
mailto:sasdtp@sas.com     (919)531-5637  Fax:677-4444   Cary, NC 27513

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Jim Jagielski <ji...@jaguNET.com>.
Via inspection this looks quite sane.

> On Aug 12, 2019, at 3:24 PM, Eric Covener <co...@gmail.com> wrote:
> 
> Hi Don, can you try this very similar patch? I applied yours to my
> sandbox to get more context and made a few minor changes (including
> pre-existing stuff that looked misleading)
> 
> http://people.apache.org/~covener/trunk-proxy-segv.diff
> 
> 


Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
Hi Don, can you try this very similar patch? I applied yours to my
sandbox to get more context and made a few minor changes (including
pre-existing stuff that looked misleading)

http://people.apache.org/~covener/trunk-proxy-segv.diff


On Mon, Aug 12, 2019 at 2:43 PM Eric Covener <co...@gmail.com> wrote:
>
> On Mon, Aug 12, 2019 at 2:37 PM Eric Covener <co...@gmail.com> wrote:
> >
> > On Mon, Aug 12, 2019 at 12:32 PM Don Poitras <Do...@sas.com> wrote:
> > >
> > > Eric,
> > >   The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening:
> > >
> > > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use
> > >
> > > 172256 .  Thawed  libc:gsignal (+0x37)                                               <--- abort
> > >   1 libc:gsignal (+0x37)
> > >   2 libc:abort (+0x143)
> > >   3 libapr-1:\apr_pools\pool_concurrency_abort 768
> > >   4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)
> > >   5 libapr-1:\thread_cond\apr_thread_cond_create 44
> > >   6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)
> > >   7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
> > >   8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)
> > >   9 httpd:UNKNOWN at 0x00000000004543A0
> > >
> > > Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called:
> > >
> > > 172271    Thawed  libc:__GI_strncmp (+0x1680)                                           <-- in use
> > >   1 libc:__GI_strncmp (+0x1680)
> > >   2 libc:getenv (+0xBD)
> > >   3 libc:__nscd_getai (+0x3D3)
> > >   4 libc:gaih_inet.constprop.8 (+0x15F2)
> > >   5 libc:getaddrinfo (+0x11F)
> > >   6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
> > >   7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
> > >   8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
> > >   9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
> > >  10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
> > >  11 httpd:UNKNOWN at 0x00000000004543A0
> >
> > Thanks, I feel like I understand a bit more now at least about the
> > apr_reslist_acquire() vs. DNS lookups cases.
> >
> > I guess in the above case ap_proxy_initialize_worker() has some funny
> > business where the state is checked outside of the global mutex and if
> > a 2nd thread sneaks in they will reuse worker->cp->pool
>
> Whoops, gmail ate a bunch of keystrokes and sent prematurely when I
> tried to paste.:
>
>         if (worker->cp == NULL)
>             init_conn_pool(p, worker);
>
> But then will create a new reslist with the old pool that is already
> being used in thread 1 that has gotten farther into the handler.
> I don't totally follow how the patch in BZ is fixing this one if
> that's what's happening.
>
> --
> Eric Covener
> covener@gmail.com



-- 
Eric Covener
covener@gmail.com

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
On Mon, Aug 12, 2019 at 2:37 PM Eric Covener <co...@gmail.com> wrote:
>
> On Mon, Aug 12, 2019 at 12:32 PM Don Poitras <Do...@sas.com> wrote:
> >
> > Eric,
> >   The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening:
> >
> > pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use
> >
> > 172256 .  Thawed  libc:gsignal (+0x37)                                               <--- abort
> >   1 libc:gsignal (+0x37)
> >   2 libc:abort (+0x143)
> >   3 libapr-1:\apr_pools\pool_concurrency_abort 768
> >   4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)
> >   5 libapr-1:\thread_cond\apr_thread_cond_create 44
> >   6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)
> >   7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
> >   8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)
> >   9 httpd:UNKNOWN at 0x00000000004543A0
> >
> > Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called:
> >
> > 172271    Thawed  libc:__GI_strncmp (+0x1680)                                           <-- in use
> >   1 libc:__GI_strncmp (+0x1680)
> >   2 libc:getenv (+0xBD)
> >   3 libc:__nscd_getai (+0x3D3)
> >   4 libc:gaih_inet.constprop.8 (+0x15F2)
> >   5 libc:getaddrinfo (+0x11F)
> >   6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
> >   7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
> >   8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
> >   9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
> >  10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
> >  11 httpd:UNKNOWN at 0x00000000004543A0
>
> Thanks, I feel like I understand a bit more now at least about the
> apr_reslist_acquire() vs. DNS lookups cases.
>
> I guess in the above case ap_proxy_initialize_worker() has some funny
> business where the state is checked outside of the global mutex and if
> a 2nd thread sneaks in they will reuse worker->cp->pool

Whoops, gmail ate a bunch of keystrokes and sent prematurely when I
tried to paste.:

        if (worker->cp == NULL)
            init_conn_pool(p, worker);

But then will create a new reslist with the old pool that is already
being used in thread 1 that has gotten farther into the handler.
I don't totally follow how the patch in BZ is fixing this one if
that's what's happening.

-- 
Eric Covener
covener@gmail.com

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
On Mon, Aug 12, 2019 at 12:32 PM Don Poitras <Do...@sas.com> wrote:
>
> Eric,
>   The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening:
>
> pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use
>
> 172256 .  Thawed  libc:gsignal (+0x37)                                               <--- abort
>   1 libc:gsignal (+0x37)
>   2 libc:abort (+0x143)
>   3 libapr-1:\apr_pools\pool_concurrency_abort 768
>   4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)
>   5 libapr-1:\thread_cond\apr_thread_cond_create 44
>   6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)
>   7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
>   8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)
>   9 httpd:UNKNOWN at 0x00000000004543A0
>
> Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called:
>
> 172271    Thawed  libc:__GI_strncmp (+0x1680)                                           <-- in use
>   1 libc:__GI_strncmp (+0x1680)
>   2 libc:getenv (+0xBD)
>   3 libc:__nscd_getai (+0x3D3)
>   4 libc:gaih_inet.constprop.8 (+0x15F2)
>   5 libc:getaddrinfo (+0x11F)
>   6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
>   7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
>   8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
>   9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
>  10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
>  11 httpd:UNKNOWN at 0x00000000004543A0

Thanks, I feel like I understand a bit more now at least about the
apr_reslist_acquire() vs. DNS lookups cases.

I guess in the above case ap_proxy_initialize_worker() has some funny
business where the state is checked outside of the global mutex and if
a 2nd thread sneaks in they will reuse worker->cp->pool

Re: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Eric Covener <co...@gmail.com>.
On Mon, Aug 12, 2019 at 4:33 PM Don Poitras <Do...@sas.com> wrote:
>
> Eric,
>   I'm not sure what you're asking. The global mutex will cover the call to init_conn_pool(), so there isn't a problem with worker->cp getting overlayed if that's what you're thinking.

I am thinking the opposite. Two threads interleaved in
ap_proxy_initialize_worker could both think they're initializing the
worker. When the 2nd one gets blocked halfway through in the global
mutex, the first will have created worker->cp->pool, left, and begin
using it (for the DNS calls).

> The problem is that ap_proxy_initialize_worker() gets called repeatedly with the same worker pointer. The first time it's called for a particular worker, it creates a connection pool and the worker thread mutex.

> But _every_ use of worker->cp->pool must be serialized.

Yes, minus the direct constructor call that only creates a subpool.

> Subsequent calls to ap_proxy_initialize_worker() with the same worker pointer will use the pool and mutex created in the first call. It's only these calls after the first that are at risk for concurrent use of the pool (hence my use of the 'notfirst' flag). In the first call, I don't even bother getting the proxy mutex as there isn't any way for other threads to be using the pool before it's created.

I think the first thread has to grab the thread mutex it just created
if the global lock is not held earlier than it is today (to make
checking the initialization state safer). Another route to safety is
to create the mutex when the worker is defined since this is at
startup rather than this lazy load but it is probably a riskier
change.

RE: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Don Poitras <Do...@sas.com>.
Eric,
  I'm not sure what you're asking. The global mutex will cover the call to init_conn_pool(), so there isn't a problem with worker->cp getting overlayed if that's what you're thinking. The problem is that ap_proxy_initialize_worker() gets called repeatedly with the same worker pointer. The first time it's called for a particular worker, it creates a connection pool and the worker thread mutex. This mutex appears to only be used in ap_proxy_determine_connection() to protect setting of worker->cp->addr by apr_sockaddr_info_get() (it also protects use of worker->cp->pool, but it's not clear that was recognized as a requirement). But _every_ use of worker->cp->pool must be serialized. Subsequent calls to ap_proxy_initialize_worker() with the same worker pointer will use the pool and mutex created in the first call. It's only these calls after the first that are at risk for concurrent use of the pool (hence my use of the 'notfirst' flag). In the first call, I don't even bother getting the proxy mutex as there isn't any way for other threads to be using the pool before it's created.

-- 
Don Poitras - Host R&D    SAS Institute Inc.            SAS Campus Drive 
mailto:sasdtp@sas.com     (919)531-5637  Fax:677-4444   Cary, NC 27513

RE: [PATCH 63503] - Reverse proxy server - SIGSEGV

Posted by Don Poitras <Do...@sas.com>.
Eric,
  The global mutex only serializes concurrent calls to ap_proxy_initialize_worker(). The worker pool is also used when the proxy_handler() is called from a thread kicked off from a _previous_ call to ap_proxy_initialize_worker() . Turning on the pool concurrency check shows this happening:

pool concurrency check: pool 0xa04348(proxy_worker_cp), thread cur 7f25a21fc700 in use by 7f2598bed700, state in use -> in use

172256 .  Thawed  libc:gsignal (+0x37)                                               <--- abort
  1 libc:gsignal (+0x37)                                         
  2 libc:abort (+0x143)                                          
  3 libapr-1:\apr_pools\pool_concurrency_abort 768               
  4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)                    
  5 libapr-1:\thread_cond\apr_thread_cond_create 44              
  6 libaprutil-1:\apr_reslist\apr_reslist_create 299 (+0x9)      
  7 mod_proxy:\proxy_util\ap_proxy_initialize_worker 2013 (+0x2F)
  8 mod_proxy:\mod_proxy\proxy_handler 1176 (+0xE)               
  9 httpd:UNKNOWN at 0x00000000004543A0  

Here's the thead the diagnostic said had the pool 'in use' when apr_palloc() was called:

172271    Thawed  libc:__GI_strncmp (+0x1680)                                           <-- in use
  1 libc:__GI_strncmp (+0x1680)
  2 libc:getenv (+0xBD)
  3 libc:__nscd_getai (+0x3D3)
  4 libc:gaih_inet.constprop.8 (+0x15F2)
  5 libc:getaddrinfo (+0x11F) 
  6 libapr-1:\sockaddr\call_resolver 397 (+0x10)
  7 mod_proxy:\proxy_util\ap_proxy_determine_connection 2506 (+0x15)
  8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
  9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
 11 httpd:UNKNOWN at 0x00000000004543A0
                        

Another possible fix would be to use the global mutex everywhere it's needed and remove the call to create and use the worker thread lock altogether, but that would mean changing the code that currently uses the worker proxy thread lock and it seemed cleaner to just change proxy_util.c rather than two source modules.

The two functions in ap_proxy_acquire_connection() that I serialized were found from other runs. Both apr_reslist_acquire() and connection_constructor() end up using the same worker pool that is being used by ap_proxy_determine_connection(). e.g.

pool concurrency check: pool 0x1d93198(proxy_worker_cp), thread cur 7f019bfff700 in use by 7f019abfd700, state in use -> in use

aborting thread:
1    Thread 0x7f019bfff700 (LWP 106622) 0x00007f01a735c207 in raise () from /lib64/libc.so.6
  3 libapr-1:\apr_pools\pool_concurrency_abort 768
  4 libapr-1:\apr_pools\apr_palloc 778 (+0xA)
  5 libaprutil-1:\apr_reslist\get_container 100 (+0x8)                               
  6 libaprutil-1:\apr_reslist\apr_reslist_acquire 121 (+0x3)
  7 mod_proxy:\proxy_util\ap_proxy_acquire_connection 2327 (+0x3)
  8 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1933 (+0x19)
  9 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 10 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)
 11 httpd:UNKNOWN at 0x00000000004543A0

And the "in-use" thread:

5    Thread 0x7f019abfd700 (LWP 106624) 0x00007f01a7424bed in connect () from /lib64/libc.so.6
  8 mod_proxy:\proxy_util\ap_proxy_determine_connection 2537 (+0x15)                 
  9 mod_proxy_http:\mod_proxy_http\proxy_http_handler 1956 (+0x1D)
 10 mod_proxy:\mod_proxy\proxy_run_scheme_handler 3063 (+0x18)
 11 mod_proxy:\mod_proxy\proxy_handler 1250 (+0x16)

Once the workers are up and running, they use thread-specific storage, but all the code that starts them up uses a pool that needs to be serialized.

worker->cp as used in ap_proxy_acquire_connection() is currently not serialized when running with the worker mpm. It can run at the same time as ap_proxy_initialize_worker() with the same pool used in both.

The problem doesn't show up when running the same test with pre-fork or event mpms.

-- 
Don Poitras - Host R&D    SAS Institute Inc.            SAS Campus Drive 
mailto:sasdtp@sas.com     (919)531-5637  Fax:677-4444   Cary, NC 27513