You are viewing a plain text version of this content. The canonical link for it is here.

Posted to bugs@httpd.apache.org by bu...@apache.org on 2016/10/21 18:32:23 UTC

[Bug 60296] New: RMM list corruption in ldap module results in server hang

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

            Bug ID: 60296
           Summary: RMM list corruption in ldap module results in server
                    hang
           Product: Apache httpd-2
           Version: 2.4.16
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: mod_ldap
          Assignee: bugs@httpd.apache.org
          Reporter: mburke@ea.com

(I originally filed this with Debian's apache team because when I searched for
related information on this issue I turned up something in their bug tracker.
Here is the link for more context:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980)

The anylock structure provided by mod_ldap was set to the type
apr_anylock_none, which resulted in multiple threads mutating the shared RMM
state at the same time without any concurrency guards. The issue I was seeing
was that all the threads on the server were stuck inside the RMM internal
function find_block_of_size every 2 or 3 days, requiring the server processes
to be killed and restarted. 

We are using Apache on Windows.

I made the following patch which passes in a lock to the RMM pool created in
mod_ldap when APR_HAS_THREADS is defined. Since doing so we have not
encountered any hangs:

diff -Naur httpd-2.4.16\include\util_ldap.h httpd-2.4.16-ea\include\util_ldap.h
--- httpd-2.4.16\include\util_ldap.h    Mon Jul 14 05:07:55 2014
+++ httpd-2.4.16-ea\include\util_ldap.h    Mon Aug 29 10:20:08 2016
@@ -169,6 +169,10 @@
 #if APR_HAS_SHARED_MEMORY
     apr_shm_t *cache_shm;
     apr_rmm_t *cache_rmm;
+#if APR_HAS_THREADS
+    apr_thread_mutex_t *lock;
+    apr_anylock_t cache_rmm_anylock;
+#endif
 #endif

     /* cache ald */
diff -Naur httpd-2.4.16\modules\ldap\util_ldap_cache.c
httpd-2.4.16-ea\modules\ldap\util_ldap_cache.c
--- httpd-2.4.16\modules\ldap\util_ldap_cache.c    Mon Aug 19 04:45:19 2013
+++ httpd-2.4.16-ea\modules\ldap\util_ldap_cache.c    Mon Aug 29 10:23:04 2016
@@ -410,6 +410,14 @@
         st->cache_shm = NULL;
         return result;
     }
+
+#if APR_HAS_THREADS
+    apr_thread_mutex_destroy(st->lock);
+    st->lock = NULL;
+    st->cache_rmm_anylock.type = apr_anylock_none;
+    st->cache_rmm_anylock.lock.pm = NULL;
+#endif
+
 #endif
     return APR_SUCCESS;
 }
@@ -436,8 +444,18 @@
         /* Determine the usable size of the shm segment. */
         size = apr_shm_size_get(st->cache_shm);

+#if APR_HAS_THREADS
+        apr_thread_mutex_create(&st->lock, APR_THREAD_MUTEX_DEFAULT,
st->pool);
+        st->cache_rmm_anylock.type = apr_anylock_threadmutex;
+        st->cache_rmm_anylock.lock.tm = st->lock;
+#else
+        st->lock = NULL;
+        st->cache_rmm_anylock.type = apr_anylock_none;
+        st->cache_rmm_anylock.lock.pm = NULL;
+#endif
+
         /* This will create a rmm "handler" to get into the shared memory area
*/
-        result = apr_rmm_init(&st->cache_rmm, NULL,
+        result = apr_rmm_init(&st->cache_rmm, &st->cache_rmm_anylock,
                               apr_shm_baseaddr_get(st->cache_shm), size,
                               st->pool);
         if (result != APR_SUCCESS) {


Thanks!

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #15 from Eric Covener <co...@gmail.com> ---
[Tue May 08 08:26:42.418299 2018] [ldap:crit] [pid 68897:tid 700008b10000]
(11)Resource deadlock avoided: [client 127.0.0.1:60404] LDAP cache lock failed

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #7 from Rafael David Tinoco <ra...@canonical.com> ---
Max or Eric,

Are there any changes in status of this bug ? 

I can corroborate to what Max said, and proposed, since I'm analysing a dump
that was brought to me, exactly in this situation (and also could find another
old bug with the same race condition):

https://bz.apache.org/bugzilla/show_bug.cgi?id=58483

These are my notes so far:

Problem summary:

apr_rmm_init acts as a relocatable memory management initialization

it is used in: mod_auth_digest and util_ldap_cache

From the dump was brought to my knowledge, in the following sequence:

- util_ldap_compare_node_copy()
- util_ald_strdup()
- apr_rmm_calloc()
- find_block_of_size()

Had a "cache->rmm_addr" with no lock at "find_block_of_size()"

cache->rmm_addr->lock { type = apr_anylock_none } 

And an invalid "next" offset (out of rmm->base->firstfree).

This rmm_addr was initialized with NULL as a locking mechanism:

From apr-utils:

apr_rmm_init()

    if (!lock) {    <-- 2nd argument to apr_rmm_init()
        nulllock.type = apr_anylock_none;    <--- found in the dump
        nulllock.lock.pm = NULL;
        lock = &nulllock;
    }

From apache:

# mod_auth_digest

    sts = apr_rmm_init(&client_rmm,
                       NULL, /* no lock, we'll do the locking ourselves */
                       apr_shm_baseaddr_get(client_shm),
                       shmem_size, ctx);

# util_ldap_cache

        result = apr_rmm_init(&st->cache_rmm, NULL,
                              apr_shm_baseaddr_get(st->cache_shm), size,
                              st->pool);

It appears that the ldap module chose to use "rmm" for memory allocation, using
the shared memory approach, but without explicitly definiting a lock to it. 
Without it, its up to the caller to guarantee that there are locks for rmm 
synchronization (just like mod_auth_digest does, using global mutexes).

Because of that, there was a race condition in "find_block_of_size" and a call
touching "rmm->base->firstfree", possibly "move_block()", in a multi-threaded 
apache environment, since there were no lock guarantees inside rmm logic (lock
was "apr_anylock_none" and the locking calls don't do anything). 

In find_block_of_size:

    apr_rmm_off_t next = rmm->base->firstfree;

We have:

    rmm->base->firstfree
        Decimal:356400
        Hex:0x57030

But "next" turned into:

Name : next
        Decimal:8320808657351632189
        Hex:0x737973636970653d

Causing:

        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

        if (blk->size == size)

To segfault.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #8 from Eric Covener <co...@gmail.com> ---
I recently re-fixed an issue that causes the lock in mod_ldap to not be used. 
It will be in 2.4.31. It is a one-line change.

http://svn.apache.org/viewvc?view=revision&revision=1824811

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Eric Covener <co...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #1 from Eric Covener <co...@gmail.com> ---
Hi Max, it appears to me that mod_ldap is trying to manage a global (cross
process  + cross thread) lock outside of rmm, which is why it tells RMM to not
use a lock.

Windows is a special case where there is only 1 process and N threads.  But
passing in a thread mutex only is probably wrong outside of windows.

Not having looked too closely, it seems likely that an LDAP_CACHE_LOCK() is
missing somewhere where we call into rmm.

Have you been able to verify on a later release?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #14 from Eric Covener <co...@gmail.com> ---
In a local test I set: `Mutex fcntl:/tmp/run default`

globally and added an assert to the two lock/unlock macros and got several
aborts.  The asserts with sysvsem never fired.

I will be removing the macros and adding inline funcitons that log and abort on
failure.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Rafael David Tinoco <ra...@canonical.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |patrick.baltz@eh-darmstadt.
                   |                            |de

--- Comment #6 from Rafael David Tinoco <ra...@canonical.com> ---
*** Bug 58483 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Dan Streetman <dd...@ieee.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ddstreet@ieee.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #2 from Max Burke <mb...@ea.com> ---
I have not tried a later release in production, no, only because we have been
soaking the change to see if it improved reliability.

I am looking at the mod_ldap changes between our current version (2.4.16) and
the latest (2.4.23) and see that the only changes are whitespace, either
adding/remove spaces or adding newlines before braces, so I would expect that
2.4.23 is similarly affected.

The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation
unit, it does not seem to be available, at present, to util_ldap_cache_mgr.c
where the RMM blocks are allocated and freed.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #4 from Max Burke <mb...@ea.com> ---
(In reply to Eric Covener from comment #3)
> What kind of Require ldap-* do you use? I noticed the debian OP didn't list
> that bit.

I'm unfamiliar with what you mean by "Require ldap-*", but here are the
snippets from our httpd.conf for the ldap settings:

LoadModule authnz_ldap_module modules/mod_authnz_ldap.so
LoadModule ldap_module modules/mod_ldap.so

LDAPTrustedMode SSL
LDAPVerifyServerCert off
LDAPSharedCacheSize 20000000

<AuthnProviderAlias ldap eac>
    AuthLDAPURL "[... snip ... ]"
    AuthLDAPBindDN  "[ ... snip ... ]"

    # including ldap password from a separate file
    # AuthLDAPBindPassword secret
    Include conf/ldap_bind_pwd.conf
</AuthnProviderAlias>

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Guido W. <gu...@mgm-tp.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #5 from Guido W. <gu...@mgm-tp.com> ---
Fixed status - sorry wrong bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #16 from Eric Covener <co...@gmail.com> ---
Hi Rafael -- out of curiosity did you ever find how fcntl() came to be used
here?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #10 from Rafael David Tinoco <ra...@canonical.com> ---
Eric,

Even after your fix...

LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for
subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch
for this bug has been applied,
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683 and the new bug
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1767105, end user still
complains about seg faults and core dumps show same issue: race condition for
rmm->base->firstfree in function "find_block_of_size".

In the dump, in find_block_of_size():

    apr_rmm_off_t next = rmm->base->firstfree;
    ...
    while(next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

blk gets value 0x5772e56b36226557 because "next" was corrupted (value:
0x57726573553d554f). This happens because the lock 

    APR_ANYLOCK_LOCK(&rmm->lock);

in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.

For the sake of exercising possibilities, if mod_ldap is calling APR RMM with
external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is
this:

Thread #19 7092 (Suspended : Container) 
        kill() at syscall-template.S:84 0x7ff7e9911767  
        <signal handler called>() at 0x7ff7e9cb7390     
        find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a    
        apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68        
        util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d    
        util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784   
        util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310     
        uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3       
        ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937 
        apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa   
        apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01   
        authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca    
        ap_run_auth_checker() at request.c:91 0x56127e692f00    
        ap_process_request_internal() at request.c:335 0x56127e695d57   
        ap_process_async_request() at http_request.c:408 0x56127e6b4690 
        ap_process_request() at http_request.c:445 0x56127e6b4850       
        ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e     
        ap_process_http_connection() at http_core.c:251 0x56127e6b091e  
        ap_run_process_connection() at connection.c:41 0x56127e6a6bf0   
        ap_process_connection() at connection.c:213 0x56127e6a7000      
        process_socket() at worker.c:631 0x7ff7e2f51f8b 
        worker_thread() at worker.c:990 0x7ff7e2f51f8b  
        start_thread() at pthread_create.c:333 0x7ff7e9cad6ba   
        clone() at clone.S:109 0x7ff7e99e341d   

Which means uldap_cache_compare() would have synchronized access to APR RMM
calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the
lock doesn't seem to be acquired.

LDAP_CACHE_LOCK() translates into:

do {
    if (st->util_ldap_cache_lock)
        apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);

After the change proposed for this bug (where "util_ldap_cache_lock" would come
from the ldap_merge_config), it seems that st has util_ldap_cache_lock and
util_ldap_cache all set:

Name : util_ldap_cache_lock
        Hex:0x7ff7ea75aee0

Name : util_ldap_cache
        Hex:0x7ff7e0e51038

Meaning that it got the ldap_cache and ldap_cache_lock from the merge config
function.

From the mutex acquire logic, for the apr_global_mutex_lock() ->
apr_proc_mutex_lock():

apr_status_t apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
{
    return mutex->meth->acquire(mutex);
}

And, from my dump, it would translate into:

st->util_ldap_cache_lock->proc_mutex->meth->acquire ==
proc_mutex_fcntl_acquire()

And from that logic:

static apr_status_t proc_mutex_fcntl_acquire(apr_proc_mutex_t *mutex)
{
    int rc;

    do {
        rc = fcntl(mutex->interproc->filedes, F_SETLKW, &proc_mutex_lock_it);
    } while (rc < 0 && errno == EINTR);
    if (rc < 0) {
        return errno;
    }
    mutex->curr_locked=1;
    return APR_SUCCESS;
}

Again, from the dump, We would guarantee mutex lock through a file descriptor
to the file:

"/var/lock/apache2/ldap-cache.1368" (filedes == 15)

And the "mutex->curr_locked" would be 1.

Unfortunately, considering my stack trace, during the cache insertion:

find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a    
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68        
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d    
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784   
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310     
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3       

Name : st->util_ldap_cache_lock
        Details:0x7ff7ea75aee0
        Default:0x7ff7ea75aee0
        Decimal:140702767230688
        Hex:0x7ff7ea75aee0
        Binary:11111111111011111101010011101011010111011100000
        Octal:03777375235327340

Name : proc_mutex
        Details:0x7ff7ea75aef8
        Default:0x7ff7ea75aef8
        Decimal:140702767230712
        Hex:0x7ff7ea75aef8
        Binary:11111111111011111101010011101011010111011111000
        Octal:03777375235327370

Name : curr_locked
        Details:0
        Default:0
        Decimal:0
        Hex:0x0
        Binary:0
        Octal:0

I have curr_locked = 0

So far.. I can see:

The functions (from fcntl type of locking) touching curr_locked are:

proc_mutex_fcntl_acquire()    -> fcntl(proc_mutex_lock_it), curr_locked = 1
proc_mutex_fcntl_tryacquire() -> unblocking fcntl(proc_mutex_lock_it),
curr_locked = 1

proc_mutex_fcntl_create()     -> apr_file_open(), curr_locked = 0
proc_mutex_fcntl_release()    -> curr_locked = 0, fcntl(proc_mutex_unlock_it)

proc_mutex_fcntl_cleanup()    -> if (cur_locked = 1) { mutex_fcntl_release() }
-> apr_file_close()

Meaning that likely _release() might have raced with _acquire() somehow. One
thing that comes to my thoughts, right away, is the fact that function
proc_mutex_fcntl_release(), from apr, sets "curr_locked" to 0 before actually
the fcntl() F_SETLKW succeeds BUT the locking is guaranteed using the file lock
with fcntl() call and not due to the variable value change, so it would be
okay-ish. 

I can't see any other obvious reason why the lock is not acquired in this
current execution. Still investigating... any ideas ?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #12 from Rafael David Tinoco <ra...@canonical.com> ---
Thread #19 7092 (Suspended : Container)
        kill() at syscall-template.S:84 0x7ff7e9911767
        <signal handler called>() at 0x7ff7e9cb7390
        find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
        apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
        util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
        util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
        util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
        uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
        ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
        apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
        apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
        authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
        ap_run_auth_checker() at request.c:91 0x56127e692f00
        ap_process_request_internal() at request.c:335 0x56127e695d57
        ap_process_async_request() at http_request.c:408 0x56127e6b4690
        ap_process_request() at http_request.c:445 0x56127e6b4850
        ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
        ap_process_http_connection() at http_core.c:251 0x56127e6b091e
        ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
        ap_process_connection() at connection.c:213 0x56127e6a7000
        process_socket() at worker.c:631 0x7ff7e2f51f8b
        worker_thread() at worker.c:990 0x7ff7e2f51f8b
        start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
        clone() at clone.S:109 0x7ff7e99e341d

--------------------------------------------------------------------------------

static apr_rmm_off_t find_block_of_size(apr_rmm_t *rmm, apr_size_t size)
{
    apr_rmm_off_t next = rmm->base->firstfree;
    apr_rmm_off_t best = 0;
    apr_rmm_off_t bestsize = 0;

    while (next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

--------------------------------------------------------------------------------

The following code touches "rmm->base->firstfree" (capable of racing):

apr_rmm_destroy()
 - ...

apr_rmm_init()
 - ...

find_block_of_size()
 - apr_rmm_calloc()
 - apr_rmm_malloc()

move_block()
 - apr_rmm_calloc()
 - apr_rmm_free()
 - apr_rmm_malloc()

--------------------------------------------------------------------------------

All APR calls to RMM have to have: APR_ANYLOCK_LOCK(&rmm->lock) and lock can't
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_compare() using the LDAP_CACHE_LOCK() directive.

LDAP_CACHE_LOCK() directive is:

        do {
                if (st->util_ldap_cache_lock)
                apr_global_mutex_lock(st->util_ldap_cache_lock);
        } while (0);

Where st is a ldap state struct, got by ap_get_module_config() on the ldap mod.
"util_ldap_cache_lock" is the global apr mutex for this module and it is set
by:

    st->util_ldap_cache_lock = base->util_ldap_cache_lock;

in "util_ldap_merge_config" AND (re)-initialized in util_ldap_post_config if,
still after ldap merge, it does NOT contain "cache_shm" and
APR_HAS_SHARED_MEMOR
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).

If re-initialized by util_ldap_post_config, util_ldap_cache_lock is a mutex
created with ap_global_mutex_create using a "ldap-cache" type.

#if APR_HAS_SHARED_MEMORY
    if (!st->cache_shm) {
#endif

If APR_HAS_SHARED_MEMORY is not set, it always initializes the lock like this.

        result = ap_global_mutex_create(&st->util_ldap_cache_lock, NULL,
                         ldap_cache_mutex_type, NULL, s, p, 0);

Since the util_ldap_cache_lock->proc_mutex->fname is this:

        0x7ff7ea75af38 == "/var/lock/apache2/ldap-cache.1368"

We know for sure that the locking being used is "ldap-cache" and was set like
above since the name is originated from "lock type" string.

--------------------------------------------------------------------------------

We can say for sure that the ldap global lock: st->util_ldap_cache_lock was
created by util_ldap_post_config() function, and the locking method was
obtained
from the "mxcfg_lookup()" logic when doing the ap_global_mutex_create() with
the
type "ldap_cache_mutex_type" ("ldap-cache" string).

ldap_cache_mutex_type variable ("ldap-cache") is a global, configured or used
by
functions coming from the util_ldap_register_hooks:

util_ldap_child_init()                  - n/a
util_ldap_post_config()                 - creates util_ldap_cache_lock based on
it
util_ldap_pre_config()                  - registers the ldap_cache_mutex_type

util_ldap_cache_lock is used for all uldap_cache_XXXXX functions calling
LDAP_CACHE_LOCK() and by the following functions:

util_ldap_child_init()                  - n/a
util_ldap_merge_config()                - set to base (server) config
util_ldap_post_config()                 - created based on
ldap_cache_mutex_type

--------------------------------------------------------------------------------

Inside util_ldap_post_config() -> ap_global_mutex_create() we called the
functio
get_futex_filename(). It checks wether the mutex mech(anism) needs a backing
file or not (only if mech == APR_LOCK_FLOCK or APR_LOCK_FCNTL)

My dump had
st->util_ldap_cache_lock->proc_mutex->fname = /var/lock/apache2/ldap-cache.1368

Meaning that my lock was, for sure, APR_LOCK_FLOCK or, likely, APR_LOCK_FCNTL.

--------------------------------------------------------------------------------

From our compilation flags:

$ apachectl -V

Server version: Apache/2.4.18 (Ubuntu)
Server built:   2018-03-01T18:29:12
Server's Module Magic Number: 20120211:52
Server loaded:  APR 1.5.2, APR-UTIL 1.5.4
Compiled using: APR 1.5.2, APR-UTIL 1.5.4
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

--------------------------------------------------------------------------------

Since we've registered this locking mechanism, ldap_cache_mutex_type, as
"APR_LOCK_DEFAULT", we have to check how the APR_LOCK_DEFAULT logic is defined
inside APR lib.

The full call path is this:

util_ldap_post_config() >>> ap_global_mutex_create() >>>
 apr_global_mutex_create() >>> apr_proc_mutex_create() >>> proc_mutex_create()
 >>> proc_mutex_choose_method()

 And the "mech" used in util_ldap_post_config(), to decide type of mutex, comes
 from the ap_global_mutex_create() function, doing the mxcfg_lookup() with
 the hash type "ldap-cache" (as I showed before, APR_LOCK_DEFAULT).

 Then the proc_mutex_choose_method() function takes its decision on what
locking
 type to use:

    switch (mech) {
    ...
        case APR_LOCK_DEFAULT:
        #if APR_USE_FLOCK_SERIALIZE
                new_mutex->inter_meth = &mutex_flock_methods;
        #elif APR_USE_SYSVSEM_SERIALIZE
                new_mutex->inter_meth = &mutex_sysv_methods;
        #elif APR_USE_FCNTL_SERIALIZE
                new_mutex->inter_meth = &mutex_fcntl_methods;
        #elif APR_USE_PROC_PTHREAD_SERIALIZE
                new_mutex->inter_meth = &mutex_proc_pthread_methods;
        #elif APR_USE_POSIXSEM_SERIALIZE
                new_mutex->inter_meth = &mutex_posixsem_methods;
        #else
                return APR_ENOTIMPL;
        #endif

Meaning that default is set compile time and, in our case, it was FCNTL based,
based on the logic above AND the dump file containing the:

        st->util_ldap_cache_lock->proc_mutex->inter_meth->*_mutex_fcntl_*
pointers

This decision path would have to have APR_USE_FCNTL_SERIALIZE set and this does
not happen for me, since the decision comes from the autoconf script:

# See which lock mechanism we'll select by default on this system.
# The last APR_DECIDE to execute sets the default.
# At this stage, we match the ordering in Apache 1.3
# which is (highest to lowest): sysvsem -> fcntl -> flock.
# POSIX semaphores and cross-process pthread mutexes are not
# used by default since they have less desirable behaviour when
# e.g. a process holding the mutex segfaults.
# The BEOSSEM decision doesn't require any substitutions but is
# included here to prevent the fcntl() branch being selected
# from the decision making.

From configure.in:

case $ac_decision in
    USE_FLOCK_SERIALIZE )
        flockser="1"
        ;;
    USE_FCNTL_SERIALIZE )
        fcntlser="1"
        ;;
    USE_SYSVSEM_SERIALIZE )
        sysvser="1"
        ;;
    USE_POSIXSEM_SERIALIZE )
        posixser="1"
        ;;
    USE_PROC_PTHREAD_SERIALIZE )
        procpthreadser="1"
        ;;
    USE_BEOSSEM )
        beossem="1"
        ;;
esac

And those variables are the ones defining the headers:

$ grep -ri APR_USE_FCNTL_SERIALIZE *
...
debian/build/include/apr.h:#define APR_USE_FCNTL_SERIALIZE           0
include/apr.h.in:#define APR_USE_FCNTL_SERIALIZE           @fcntlser@
...

This is what would make proc_mutex_choose_method() to decide to use fcntl
locking mechanism and you can see that, in my case, fakeroot debian/rules
was build-arch did NOT set that variable. In the configure output, you can see:

decision on apr_lock implementation method... SysV IPC semget()

Which means that the APR_DECIDE was sysvser, like described, as we can see:

$ grep -ri APR_USE_SYSVSEM_SERIALIZE *  | grep -v html
debian/build/include/apr.h:#define APR_USE_SYSVSEM_SERIALIZE         1
include/apr.h.in:#define APR_USE_SYSVSEM_SERIALIZE         @sysvser@

in the build dir.

--------------------------------------------------------------------------------

So, how was fcntl choosen as locking mech(anism) if APR_USE_FCNTL_SERIALIZE
is not set ? The only possibility I see is that in the beginning of
proc_mutex_choose_method, mech == APR_LOCK_FCNTL was passed as an argument,
and NOT APR_LOCK_DEFAULT, like it should have been in the "ldap-cache" type of
mutex, when being created. OR for the APR_LOCK_DEFAULT to be something else.

With that, I went to check what defined it and found apache's function:
ap_set_mutex(), called by apache "init core commands" logic. This function will
set the mechanism for APR_LOCK_DEFAULT depending on the configuration file
stanza (derd!) and with that I went for it and found:

Mutex file:${APACHE_LOCK_DIR} default

--------------------------------------------------------------------------------

SUMMARY:

Now it makes more sense. So the "bug" here now is to find out why fcntl locking
did not guarantee atomicity for the LDAP operations, or even, if it should have
done (based on its premises). 

Another question that raises is, why to use fcntl as a backing mechanism for
the
LDAP locking ? If the lock was supposed to be guaranteed among different nodes,
then backing the lock in a file, IF the name was based also on the instance id,
which in this case is not, it could make sense. But in a shared-threaded only
environment, why ?

TODOs:

(1)

I would like end user to test the LDAP caching operations/locking without fcntl
based locking. According to:

https://httpd.apache.org/docs/2.4/mod/core.html#mutex

It is possible to set a different locking mechanism to each of the subsystems
described in table "Mutex name, Module(s), Protected Resource". This, per se,
will likely guarantee the atomicity of the LDAP cache operations.

(2)

To investigate if there is some catch in regarding to using fcntl() in an
intensive I/O environment for backing up mutexes implementation. Is there any
fix related to fcntl() that should be picked OR any type of I/O configuration
(including page cache, filesystem mounting options, need for direct-io) in
order
for it to be used ?

Eric,

A colleague of mine, or I, will get back to you with some answers on all this
investigation. Tks a lot!

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Guido W. <gu...@mgm-tp.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #11 from Eric Covener <co...@gmail.com> ---
Is fcntl being forced here somehow, or is it the default (apachectl -V|grep
APR_USE) ?  I wonder if an affected user could try a different mechanism, via
e.g. https://httpd.apache.org/docs/2.4/mod/core.html#mutex with "ldap-cache" as
the parm (or if they already are somehow?)

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #9 from Rafael David Tinoco <ra...@canonical.com> ---
Just for a reference on someone that finds this bug. 

Distributions Fixes (Debian & Ubuntu):

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980#15
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683

With LP: #1752683 being worked on (about to suggest final fix for all Ubuntu
versions), it is likely I'll suggest same fix for Bug: #814980, both based in
upstream commit: 39ae6cd64268 brought to me by Eric.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #3 from Eric Covener <co...@gmail.com> ---
(In reply to Max Burke from comment #2)
> I have not tried a later release in production, no, only because we have
> been soaking the change to see if it improved reliability.
> 
> I am looking at the mod_ldap changes between our current version (2.4.16)
> and the latest (2.4.23) and see that the only changes are whitespace, either
> adding/remove spaces or adding newlines before braces, so I would expect
> that 2.4.23 is similarly affected.
> 
> The LDAP_CACHE_LOCK construct is defined only to the util_ldap.c translation
> unit, it does not seem to be available, at present, to util_ldap_cache_mgr.c
> where the RMM blocks are allocated and freed.

It's not directly available, but that lock is meant to be held by anyone who
would ultimately get into the code that does the allocations. That is usually
(AFAICT) via util_ald_cache_insert().

What kind of Require ldap-* do you use? I noticed the debian OP didn't list
that bit.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #17 from Rafael David Tinoco <ra...@canonical.com> ---
Hello Eric,

Sorry for taking so long in providing you feedback, I moved from one company to
another and there is another engineer dealing with this case on the previous
company (and I lost emails from this thread, only noticed it because debian bug
merged this into its bug and I got my name highlighted).

Nevertheless, yes, I discovered - through the dump and execution logic - final
user was enforcing it by configuration file without any particular reason for
it. 

I didn't have the chance to ask final user to remove that and try again (which
I think subsequent communication here might indicate as next steps), but I'm
almost sure that this is indeed the issue.

Tks for providing feedback so fast to all my investigation!! The only thing
here as a *TODO* for upstream work, if I might say that, would be to understand
better why fcntl is the cause for the race (not guaranteeing atomicity). 

Likely a sync and/or I/O barrier issue ? Exactly how you described in your
tests!

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #13 from Eric Covener <co...@gmail.com> ---

> Another question that raises is, why to use fcntl as a backing mechanism for
> the
> LDAP locking ? If the lock was supposed to be guaranteed among different
> nodes,
> then backing the lock in a file, IF the name was based also on the instance
> id,
> which in this case is not, it could make sense. But in a shared-threaded only
> environment, why ?

the locking is only meant to be among the N children of a single instance /
parent process. You should find the PID here is the parent pid. So I think this
aspect is OK.

I have seen fcntl related errors on other OS'es, where APR might use it by
default, related to EDEADLOCK being returned when two unrelated threads use two
unrelated fcntl mutexes.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org

Re: [Bug 60296] RMM list corruption in ldap module results in server hang

Posted by Eric Covener <co...@gmail.com>.

On Fri, Jun 15, 2018 at 10:55 AM Yann Ylavic <yl...@gmail.com> wrote:
>
> On Fri, Jun 15, 2018 at 4:25 PM, Eric Covener <co...@gmail.com> wrote:
> > Should we swap fnctl and flock?
>
> Is flock() any better on all systems?

I am not sure, but it at least does not document EDEADLCK on Linux
like fcntl does.

>
> >
> > Or block file: for 'default' since it would choose multiple fcntls?
>
> Looks sensible, or choose the one or the other depending on which
> works better for a particular system (if we can determine this..).

Good point. The manual currently lists Solaris as problematic but i
think it's too narrow.  Of course mod_ldap was complicit in all of
this with not checking return values or even logging errors.

-- 
Eric Covener
covener@gmail.com

Re: [Bug 60296] RMM list corruption in ldap module results in server hang

Posted by Yann Ylavic <yl...@gmail.com>.

On Fri, Jun 15, 2018 at 4:25 PM, Eric Covener <co...@gmail.com> wrote:
> Should we swap fnctl and flock?

Is flock() any better on all systems?

>
> Or block file: for 'default' since it would choose multiple fcntls?

Looks sensible, or choose the one or the other depending on which
works better for a particular system (if we can determine this..).

Fwd: [Bug 60296] RMM list corruption in ldap module results in server hang

Posted by Eric Covener <co...@gmail.com>.

Should we swap fnctl and flock?

Or block file: for 'default' since it would choose multiple fcntls?

https://bugzilla.mozilla.org/show_bug.cgi?id=62457#c5


#if APR_HAS_FCNTL_SERIALIZE
   else if (!strcasecmp(meth, "fcntl") || !strcasecmp(meth, "file")) {
       *mutexmech = APR_LOCK_FCNTL;
   }
#endif
#if APR_HAS_FLOCK_SERIALIZE
   else if (!strcasecmp(meth, "flock") || !strcasecmp(meth, "file")) {
       *mutexmech = APR_LOCK_FLOCK;
   }
#endif






---------- Forwarded message ---------
From: <bu...@apache.org>
Date: Thu, Jun 14, 2018 at 11:47 AM
Subject: [Bug 60296] RMM list corruption in ldap module results in server hang
To: <bu...@httpd.apache.org>


https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Dan Streetman <dd...@ieee.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #18 from Dan Streetman <dd...@ieee.org> ---
To follow up with results of testing non-file locking: yep, the end user we
were working with initially had this in their config, as Rafael said:

Mutex file:${APACHE_LOCK_DIR} default

and we asked them to update it to not use file for ldap, by changing to:

Mutex file:${APACHE_LOCK_DIR} default
Mutex sem ldap-cache


After that change (done about 1 month ago) they restarted and have not seen the
problem happen again.  I don't have any answer to why exactly they set the
default to file locking.

--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org



-- 
Eric Covener
covener@gmail.com

[Bug 60296] RMM list corruption in ldap module results in server hang

Posted by bu...@apache.org.

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

Dan Streetman <dd...@ieee.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #18 from Dan Streetman <dd...@ieee.org> ---
To follow up with results of testing non-file locking: yep, the end user we
were working with initially had this in their config, as Rafael said:

Mutex file:${APACHE_LOCK_DIR} default

and we asked them to update it to not use file for ldap, by changing to:

Mutex file:${APACHE_LOCK_DIR} default
Mutex sem ldap-cache


After that change (done about 1 month ago) they restarted and have not seen the
problem happen again.  I don't have any answer to why exactly they set the
default to file locking.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org