You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/04/23 23:53:18 UTC

[jira] [Commented] (MESOS-300) Libprocess throws exception in SocketManager::next()

    [ https://issues.apache.org/jira/browse/MESOS-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13639684#comment-13639684 ] 

Benjamin Mahler commented on MESOS-300:
---------------------------------------

Vinod and I took a look at this, and we have a hypothesis as to how this could occur:

1. Initially, we have a socket s1, that has outgoing data pending on IO. This means there is an ev_watcher with an Encoder holding s1. The watcher has yet to call send_data().

2. The socket is closed out-of-band. This clears the socket s1 from the 'sockets' map, and also removes the entry from the 'outgoing' map.

3. Now, a new socket is created, and the kernel re-uses s1 as the descriptor. This creates an entry in the 'sockets' map.

4. Now, the ev_watcher (for the old socket) calls send_data, which calls into SocketManager::next().

5. The if (sockets.count(s) > 0) condition is true, however the CHECK(outgoing.count(s) > 0) fails as the new socket may not have any outgoing data.

A proposed fix is to associate each socket with a UUID. This would allow a check in SocketManager::next() against the UUID rather than the socket descriptor. Checking against the socket descriptor is dangerous as it may be re-used by the kernel!

We could also make use of this UUID in other places where we need to check for the socket being closed out-of-band.
                
> Libprocess throws exception in SocketManager::next()
> ----------------------------------------------------
>
>                 Key: MESOS-300
>                 URL: https://issues.apache.org/jira/browse/MESOS-300
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Benjamin Hindman
>            Priority: Blocker
>
> Came across this while I was debugging an issue at Twitter.
> I1025 18:34:52.799145 56374 dominant_share_allocator.cpp:417] Performed allocation for 1004 slaves in 337.449 milliseconds
> F1025 18:34:53.633313 56380 process.cpp:1827] Check failed: outgoing.count(s) > 0 
> *** Check failure stack trace: ***
>     @     0x7f68b604f03d  google::LogMessage::Fail()
>     @     0x7f68b6054ca7  google::LogMessage::SendToLog()
>     @     0x7f68b60508ec  google::LogMessage::Flush()
>     @     0x7f68b6050b56  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f68b5f3679c  process::SocketManager::next()
>     @     0x7f68b5f37704  process::send_data()
>     @     0x7f68b60940e3  ev_invoke_pending
>     @     0x7f68b6099518  ev_loop
>     @     0x7f68b5f3332a  process::serve()
>     @     0x7f68b531e73d  start_thread
>     @     0x7f68b4908f6d  clone
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.
> Grokking the code, there is a huge comment stating we cannot/shouldn't be doing this check. right above where this check happens. 
> Encoder* SocketManager::next(int s)
> {
>   HttpProxy* proxy = NULL; // Non-null if needs to be terminated.
>   synchronized (this) {
>     // We cannot assume 'sockets.count(s) > 0' here because it's
>     // possible that 's' has been removed with a a call to
>     // SocketManager::close. For example, it could be the case that a
>     // socket has gone to CLOSE_WAIT and the call to 'recv' in
>     // recv_data returned 0 causing SocketManager::close to get
>     // invoked. Later a call to 'send' or 'sendfile' (e.g., in
>     // send_data or send_file) can "succeed" (because the socket is
>     // not "closed" yet because there are still some Socket
>     // references, namely the reference being used in send_data or
>     // send_file!). However, when SocketManger::next is actually
>     // invoked we find out there there is no more data and thus stop
>     // sending.
>     // TODO(benh): Should we actually finish sending the data!?
>     if (sockets.count(s) > 0) {
>       CHECK(outgoing.count(s) > 0);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira