You are viewing a plain text version of this content. The canonical link for it is here.
Posted to bugs@httpd.apache.org by bu...@apache.org on 2018/04/26 12:29:05 UTC
[Bug 60296] RMM list corruption in ldap module results in server hang

https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #10 from Rafael David Tinoco <ra...@canonical.com> ---
Eric,

Even after your fix...

LDAP_CACHE_LOCK() is either missing a barrier or it is not enough for
subsequent calls to APR with NULL locking (passed to APR_RMM_INIT). After patch
for this bug has been applied,
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1752683 and the new bug
https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1767105, end user still
complains about seg faults and core dumps show same issue: race condition for
rmm->base->firstfree in function "find_block_of_size".

In the dump, in find_block_of_size():

    apr_rmm_off_t next = rmm->base->firstfree;
    ...
    while(next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

blk gets value 0x5772e56b36226557 because "next" was corrupted (value:
0x57726573553d554f). This happens because the lock 

    APR_ANYLOCK_LOCK(&rmm->lock);

in apr_rmm_calloc() is apr_anylock_none, like previously reported by me.

For the sake of exercising possibilities, if mod_ldap is calling APR RMM with
external locking, it would be using LDAP_CACHE_LOCK. My current stack trace is
this:

Thread #19 7092 (Suspended : Container) 
        kill() at syscall-template.S:84 0x7ff7e9911767  
        <signal handler called>() at 0x7ff7e9cb7390     
        find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a    
        apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68        
        util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d    
        util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784   
        util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310     
        uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3       
        ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937 
        apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa   
        apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01   
        authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca    
        ap_run_auth_checker() at request.c:91 0x56127e692f00    
        ap_process_request_internal() at request.c:335 0x56127e695d57   
        ap_process_async_request() at http_request.c:408 0x56127e6b4690 
        ap_process_request() at http_request.c:445 0x56127e6b4850       
        ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e     
        ap_process_http_connection() at http_core.c:251 0x56127e6b091e  
        ap_run_process_connection() at connection.c:41 0x56127e6a6bf0   
        ap_process_connection() at connection.c:213 0x56127e6a7000      
        process_socket() at worker.c:631 0x7ff7e2f51f8b 
        worker_thread() at worker.c:990 0x7ff7e2f51f8b  
        start_thread() at pthread_create.c:333 0x7ff7e9cad6ba   
        clone() at clone.S:109 0x7ff7e99e341d   

Which means uldap_cache_compare() would have synchronized access to APR RMM
calls through LDAP_CACHE_LOCK() macro. This doesn't seem to be the case as the
lock doesn't seem to be acquired.

LDAP_CACHE_LOCK() translates into:

do {
    if (st->util_ldap_cache_lock)
        apr_global_mutex_lock(st->util_ldap_cache_lock);
} while (0);

After the change proposed for this bug (where "util_ldap_cache_lock" would come
from the ldap_merge_config), it seems that st has util_ldap_cache_lock and
util_ldap_cache all set:

Name : util_ldap_cache_lock
        Hex:0x7ff7ea75aee0

Name : util_ldap_cache
        Hex:0x7ff7e0e51038

Meaning that it got the ldap_cache and ldap_cache_lock from the merge config
function.

From the mutex acquire logic, for the apr_global_mutex_lock() ->
apr_proc_mutex_lock():

apr_status_t apr_proc_mutex_lock(apr_proc_mutex_t *mutex)
{
    return mutex->meth->acquire(mutex);
}

And, from my dump, it would translate into:

st->util_ldap_cache_lock->proc_mutex->meth->acquire ==
proc_mutex_fcntl_acquire()

And from that logic:

static apr_status_t proc_mutex_fcntl_acquire(apr_proc_mutex_t *mutex)
{
    int rc;

    do {
        rc = fcntl(mutex->interproc->filedes, F_SETLKW, &proc_mutex_lock_it);
    } while (rc < 0 && errno == EINTR);
    if (rc < 0) {
        return errno;
    }
    mutex->curr_locked=1;
    return APR_SUCCESS;
}

Again, from the dump, We would guarantee mutex lock through a file descriptor
to the file:

"/var/lock/apache2/ldap-cache.1368" (filedes == 15)

And the "mutex->curr_locked" would be 1.

Unfortunately, considering my stack trace, during the cache insertion:

find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a    
apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68        
util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d    
util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784   
util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310     
uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3       

Name : st->util_ldap_cache_lock
        Details:0x7ff7ea75aee0
        Default:0x7ff7ea75aee0
        Decimal:140702767230688
        Hex:0x7ff7ea75aee0
        Binary:11111111111011111101010011101011010111011100000
        Octal:03777375235327340

Name : proc_mutex
        Details:0x7ff7ea75aef8
        Default:0x7ff7ea75aef8
        Decimal:140702767230712
        Hex:0x7ff7ea75aef8
        Binary:11111111111011111101010011101011010111011111000
        Octal:03777375235327370

Name : curr_locked
        Details:0
        Default:0
        Decimal:0
        Hex:0x0
        Binary:0
        Octal:0

I have curr_locked = 0

So far.. I can see:

The functions (from fcntl type of locking) touching curr_locked are:

proc_mutex_fcntl_acquire()    -> fcntl(proc_mutex_lock_it), curr_locked = 1
proc_mutex_fcntl_tryacquire() -> unblocking fcntl(proc_mutex_lock_it),
curr_locked = 1

proc_mutex_fcntl_create()     -> apr_file_open(), curr_locked = 0
proc_mutex_fcntl_release()    -> curr_locked = 0, fcntl(proc_mutex_unlock_it)

proc_mutex_fcntl_cleanup()    -> if (cur_locked = 1) { mutex_fcntl_release() }
-> apr_file_close()

Meaning that likely _release() might have raced with _acquire() somehow. One
thing that comes to my thoughts, right away, is the fact that function
proc_mutex_fcntl_release(), from apr, sets "curr_locked" to 0 before actually
the fcntl() F_SETLKW succeeds BUT the locking is guaranteed using the file lock
with fcntl() call and not due to the variable value change, so it would be
okay-ish. 

I can't see any other obvious reason why the lock is not acquired in this
current execution. Still investigating... any ideas ?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org