You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org> on 2012/03/14 18:36:42 UTC

[jira] [Commented] (MESOS-165) Slaves die after initial registration with master with "Network is unreachable" error

    [ https://issues.apache.org/jira/browse/MESOS-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229412#comment-13229412 ] 

jiraposter@reviews.apache.org commented on MESOS-165:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4355/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Jessica.


Summary
-------

libprocess currently binds to INADDR_ANY and uses the result of getsockname() as __ip__, overwriting its reading of LIBPROCESS_IP. This patch should use the environment variable setting (when it is not 0 == INADDR_ANY) when it is supplied instead of using getsockname().

I think this bug is the cause of MESOS-165.


This addresses bug MESOS-165.
    https://issues.apache.org/jira/browse/MESOS-165


Diffs
-----

  third_party/libprocess/src/process.cpp 7433be8 

Diff: https://reviews.apache.org/r/4355/diff


Testing
-------


Thanks,

Charles


                
> Slaves die after initial registration with master with "Network is unreachable" error
> -------------------------------------------------------------------------------------
>
>                 Key: MESOS-165
>                 URL: https://issues.apache.org/jira/browse/MESOS-165
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>         Environment: Scientific Linux 6.2 internal cluster
>            Reporter: Jessica J
>            Priority: Blocker
>
> I am using a cluster in which only the master is externally accessible, so when I start the master, I set --ip to one of its internal IP addresses so that it can communicate with its slaves. I have also tried setting this ip address in mesos-env.sh (in the deploy directory) by setting LIBPROCESS_IP, but each time the master starts, it says that it is running at the external IP address (as if it is ignoring the --ip or LIBPROCESS_IP options).
> When I start a slave, I tell it that the master is at an internal IP address (no matter what the master says it's running at), so the initial connection is successful. (I get messages output from both the slave and the master saying the connection was successful.) However, after registering, the slave *immediately* dies. My guess is that upon successful connection, the master tells the slave to communicate with it on the external IP address, but since the slave has no access to the Internet, any further communication fails. 
> The following is the error message the slave gives when it dies:
> F0314 12:25:45.196940 13406 process.cpp:1576] Failed to link, connect: Network is unreachable [101]
> *** Check failure stack trace: ***
>     @     0x7f7d6be3342d  google::LogMessage::Fail()
>     @     0x7f7d6be36ae7  google::LogMessage::SendToLog()
>     @     0x7f7d6be36066  google::LogMessage::Flush()
>     @     0x7f7d6be36279  google::LogMessage::~LogMessage()
>     @     0x7f7d6be39351  google::ErrnoLogMessage::~ErrnoLogMessage()
>     @     0x7f7d6be47319  process::SocketManager::link()
>     @     0x7f7d6be4bc88  process::ProcessManager::link()
>     @     0x7f7d6be4ed98  process::ProcessBase::link()
>     @     0x7f7d6bcaf575  mesos::internal::slave::Slave::newMasterDetected()
>     @     0x7f7d6bcbbd7f  ProtobufProcess<>::handler1<>()
>     @     0x7f7d6bcbe477  ProtobufProcess<>::visit()
>     @     0x7f7d6be504e0  process::MessageEvent::visit()
>     @     0x7f7d6be4b448  process::ProcessManager::resume()
>     @     0x7f7d6be43bae  process::schedule()
>     @     0x7f7d6b5a77f1  start_thread
>     @     0x7f7d6a93c92d  clone
> Aborted
> I have looked at the code (master.cpp, process.cpp, main.cpp, slave.cpp, mesos-master.sh, etc.) and tried to determine why the ip option is getting ignored, but I have thus far been unsuccessful.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira