You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benno Evers (JIRA)" <ji...@apache.org> on 2019/06/27 11:26:00 UTC

[jira] [Comment Edited] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests

    [ https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873908#comment-16873908 ] 

Benno Evers edited comment on MESOS-9867 at 6/27/19 11:25 AM:
--------------------------------------------------------------

This was fixed upstream with libevent 2.1.8: https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a


was (Author: bennoe):
This was fixed upstream with libevent-2.1.10: https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a

> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---------------------------------------------------------------
>
>                 Key: MESOS-9867
>                 URL: https://issues.apache.org/jira/browse/MESOS-9867
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>              Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity during the OpenSSL handshake and afterwards call the `openssl::verify()` function to perform hostname validation and other checks on the client certificate. If these checks fail, the bufferevent is deleted and the connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
>           if (verify.isError()) {
>             VLOG(1) << "Failed accept, verification error: " << verify.error();
>             request->promise.fail(verify.error());
>             SSL_free(ssl);
>             bufferevent_free(bev);
>             // TODO(jmlvanre): Clean up for readability. Consider RAII
>             // or constructing the impl earlier.
>             CHECK(request->socket >= 0);
>             Try<Nothing> close = os::close(request->socket);
>             if (close.isError()) {
>               LOG(FATAL)
>                 << "Failed to close socket " << stringify(request->socket)
>                 << ": " << close.error();
>             }
>             delete request;
>             return;
>           }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already registered that file descriptor with epoll() to watch for read and write events on that socket. Since the socket is closed, attempts to remove the corresponding fd from the epoll() structs will fail: (See also: https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel object that was associated with fd 9 from the data structure associated with that epoll instance in the kernel. So while we get an error attempting to remove fd 9, there is actually nothing left to remove. However, in a case of epoll failure, libprocess does not adjust the number of readers and writers on that file descriptor:
> {noformat}
>         // evmap.c, evmap_io_del()
>         [...]
>         if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
>                return (-1);
>         [...]
>         ctx->nread = nread;
>         ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file descriptor. That still wouldn't be so bad, however libevent also only adds file descriptors to `epoll()` struct the *first* time we attempt to create a read or write event on that file descriptor:
> {noformat}
>         // evmap.c, evmap_io_add()
>         if (ev->ev_events & EV_READ) {
>                 if (++nread == 1)
>                         res |= EV_READ;
>         }
>         if (ev->ev_events & EV_WRITE) {
>                 if (++nwrite == 1)
>                         res |= EV_WRITE;
>         }
>         [...]
>         if (res) {
>                 [...]
>                 if (evsel->add(base, ev->ev_fd,
>                         old, (ev->ev_events & EV_ET) | res, extra) == -1)
>                         return (-1);
>                 [...]
>         }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent for epoll() polling, the process will hang because reads or writes to that file descriptor are never noticed.
> This can be reproduced for example by running a test where the `verify()`-callback fails on the server side twice in a row: (note, the LIBPROCESS_IP below is set in order to induce a test failure, result may vary on your local network and ssl configuration)
> {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
> {noformat}
> There is a chance that the issue described here is the same as the ominous "issues" described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)