You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by "Buck, Joe" <bu...@amazon.com> on 2014/07/29 02:02:50 UTC

Another question about scheduler / master communication

The earlier email about running a scheduler outside the Mesos cluster reminded me of an issue that we encountered last week that I thought may bite other Mesos users (and I’m wondering if there could be checks added to make it more readily apparent in the logs). The symptom was a driver node seeing its authentication attempt timeout after sending out the initial message. The root cause was a bad /etc/hosts entry.

We were testing enabling authentication for our Mesos cluster as part of deploying Mesos-0.19. One of our drivers wasn’t able to connect to the master and the failure seemed to occur during the handshake. Logs showed the client sending out the initial message and the master responding but nothing past that. Some wireshark-ing showed us that the initial message from the framework had “libprocess/authenticatee(1)@127.0.0.1” in the message (which we determined was due to a bad /etc/hosts entry on the driver node).  So, the Mesos master (which was running on a different host) dutifully replied to that address and that is where the process came off the rails.

In hindsight, we realized that this Spark warning message spelled out the issue:
"WARN Utils: Your hostname, xxxx resolves to a loopback address: 127.0.0.1; using yyy.yyy.yyy.yyy instead (on interface eth0)”

I was wondering if it would be possible to detect (in the libmesos library) a framework sending out a loop back address and either trying  to use a different (more sensible) interface (akin to what Spark does) or logging the fact that it is sending out a loopback address very prominently. This all assumes that the master isn’t also using the loopback (I can see that being a valid setup for single-host use).

Best Regards,
-Joe Buck`