You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by "Hoffman, Steve" <St...@orbitz.com> on 2014/07/01 20:36:23 UTC
Re: Multiple Slaves on Mesos Cluster

Yes, I had to do this as well.  I suspect it has something to do with how the slave trying to figure out the primary IP.
On my machines the adapter is ‘em1’ not the typical ‘eth0’ due to an underlying raid controller.
This could explain why it uses the ‘lo’ interface?
Just a guess…

Steve

From: Tomas Barton <ba...@gmail.com>>
Reply-To: Mesos Users <us...@mesos.apache.org>>
Date: Saturday, June 28, 2014 at 3:57 AM
To: Mesos Users <us...@mesos.apache.org>>
Subject: Re: Multiple Slaves on Mesos Cluster

The IP addresses are suspicious: 127.0.1.1:5051<http://127.0.1.1:5051/> looks like you communicate with localhost. Make sure you pass to both master and slave correct `--ip ` flag. Correct should be the IP address of the interface which is used for connecting these machines.

if the hostname is correct

Name:   hotbox-30.stanford.edu<http://hotbox-30.stanford.edu>
Address: 10.79.6.70

 then --ip 10.79.6.70



On 28 June 2014 02:12, Sammy Steele <sa...@stanford.edu>> wrote:
Thanks so much for your help! I am completely new to Mesos, so I am not sure if I am interpreting your question correctly. Do you mean these logs (which I've attached pictures of)? Or do you mean the log generated by the manager which looks like this (after attempting to register one slave on the same computer and one on a different computer):

I0627 17:01:39.385833 32259 replica.cpp:661] Replica learned TRUNCATE action at position 1640
I0627 17:01:46.255574 32260 http.cpp:452] HTTP request for '/master/state.json'
I0627 17:01:55.490910 32255 master.cpp:2477] Re-registering slave 20140627-105920-16777343-5050-32615-0 at slave(1)@127.0.1.1:5051<http://127.0.1.1:5051> (hotbox-30.Stanford.EDU<http://hotbox-30.Stanford.EDU>)
I0627 17:01:55.491575 32258 registrar.cpp:422] Attempting to update the 'registry'
I0627 17:01:55.493095 32260 log.cpp:680] Attempting to append 322 bytes to the log
I0627 17:01:55.493242 32255 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1641
I0627 17:01:55.493803 32255 replica.cpp:508] Replica received write request for position 1641
I0627 17:01:55.494604 32255 leveldb.cpp:343] Persisting action (342 bytes) to leveldb took 738686ns
I0627 17:01:55.494701 32255 replica.cpp:676] Persisted action at 1641
I0627 17:01:55.495028 32258 replica.cpp:655] Replica received learned notice for position 1641
I0627 17:01:55.495808 32258 leveldb.cpp:343] Persisting action (344 bytes) to leveldb took 699563ns
I0627 17:01:55.495873 32258 replica.cpp:676] Persisted action at 1641
I0627 17:01:55.495931 32258 replica.cpp:661] Replica learned APPEND action at position 1641
I0627 17:01:55.496388 32261 registrar.cpp:479] Successfully updated 'registry'
I0627 17:01:55.496537 32256 log.cpp:699] Attempting to truncate the log to 1641
I0627 17:01:55.496665 32260 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 1642
I0627 17:01:55.496690 32257 master.cpp:2528] Re-registered slave 20140627-105920-16777343-5050-32615-0 at slave(1)@127.0.1.1:5051<http://127.0.1.1:5051> (hotbox-30.Stanford.EDU<http://hotbox-30.Stanford.EDU>)
I0627 17:01:55.496824 32257 master.cpp:3472] Adding slave 20140627-105920-16777343-5050-32615-0 at slave(1)@127.0.1.1:5051<http://127.0.1.1:5051> (hotbox-30.Stanford.EDU<http://hotbox-30.Stanford.EDU>) with cpus(*):8; mem(*):15024; disk(*):448079; ports(*):[31000-32000]
I0627 17:01:55.497179 32256 replica.cpp:508] Replica received write request for position 1642
I0627 17:01:55.497406 32257 hierarchical_allocator_process.hpp:444] Added slave 20140627-105920-16777343-5050-32615-0 (hotbox-30.Stanford.EDU<http://hotbox-30.Stanford.EDU>) with cpus(*):8; mem(*):15024; disk(*):448079; ports(*):[31000-32000] (and cpus(*):8; mem(*):15024; disk(*):448079; ports(*):[31000-32000] available)
I0627 17:01:55.497931 32256 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 685099ns
I0627 17:01:55.498000 32256 replica.cpp:676] Persisted action at 1642
I0627 17:01:55.498262 32261 replica.cpp:655] Replica received learned notice for position 1642
I0627 17:01:55.499034 32261 leveldb.cpp:343] Persisting action (20 bytes) to leveldb took 676723ns
I0627 17:01:55.499114 32261 leveldb.cpp:401] Deleting ~2 keys from leveldb took 17977ns
I0627 17:01:55.499174 32261 replica.cpp:676] Persisted action at 1642
I0627 17:01:55.499232 32261 replica.cpp:661] Replica learned TRUNCATE action at position 1642
I0627 17:01:56.261155 32261 http.cpp:452] HTTP request for '/master/state.json'
I0627 17:02:06.306571 32257 http.cpp:452] HTTP request for '/master/state.json'
I0627 17:02:16.337009 32255 http.cpp:452] HTTP request for '/master/state.json'
I0627 17:02:26.346575 32256 http.cpp:452] HTTP request for '/master/state.json'
I0627 17:02:36.356895 32257 http.cpp:452] HTTP request for '/master/state.json'

Or this LOG which seems to be generated by mesos in the working directory?

2014/06/27-17:01:39.350738 7f83eac8f740 Recovering log #81
2014/06/27-17:01:39.350881 7f83eac8f740 Level-0 table #85: started
2014/06/27-17:01:39.353463 7f83eac8f740 Level-0 table #85: 1720 bytes OK
2014/06/27-17:01:39.358567 7f83eac8f740 Delete type=0 #81
2014/06/27-17:01:39.358813 7f83eac8f740 Delete type=3 #78
2014/06/27-17:01:39.359606 7f83e2f48700 Level-0 table #88: started
2014/06/27-17:01:39.359817 7f83e2f48700 Level-0 table #88: 0 bytes OK
2014/06/27-17:01:39.360829 7f83e2f48700 Delete type=0 #86
2014/06/27-17:01:39.361097 7f83e2f48700 Manual compaction at level-0 from (begin) .. (end); will stop at '0000001639' @ 4929 : 1
2014/06/27-17:01:39.361107 7f83e2f48700 Compacting 1@0 + 1@1 files
2014/06/27-17:01:39.363837 7f83e2f48700 Generated table #89: 3 keys, 523 bytes
2014/06/27-17:01:39.363851 7f83e2f48700 Compacted 1@0 + 1@1 files => 523 bytes
2014/06/27-17:01:39.364422 7f83e2f48700 compacted to: files[ 0 1 0 0 0 0 0 ]
2014/06/27-17:01:39.364852 7f83e2f48700 Delete type=2 #85
2014/06/27-17:01:39.365031 7f83e2f48700 Delete type=2 #83
2014/06/27-17:01:39.365305 7f83e2f48700 Manual compaction at level-0 from '0000001639' @ 4929 : 1 .. (end); will stop at (end)



On Fri, Jun 27, 2014 at 4:50 PM, Vinod Kone <vi...@gmail.com>> wrote:
It looks like the framework and slave are not able to properly register with the master due to networking issues. There should be log messages indicating whether master received registration requests are not.

> "I0627 16:02:42.431401 10059 slave.cpp:2873] Current usage 0.81%. Max allowed age: 6.243193692985590days)", indicating that it has connected to the master.

This just tells you that slave is running. It has nothing to do with whether it is registered with the master or not.

What do master and framework logs say?



On Fri, Jun 27, 2014 at 4:36 PM, Sammy Steele <sa...@stanford.edu>> wrote:
Hi,

I am trying to set up Mesos with 1 master and 3 slaves running on several computers all on the same switch. Each computer is a 4-socket dual core machine running ubuntu 13.04. I have installed mesos, and can create one master and one slave when ssh'd into the same computer using the local IP. However, when I try to create a slave on a second computer connected via the master's public IP, the slave appears to register.  I get a message:

"I0627 16:02:42.431401 10059 slave.cpp:2873] Current usage 0.81%. Max allowed age: 6.243193692985590days)", indicating that it has connected to the master.

 However, the Mesos tracking website does not recognize the second slave.

Additionally, frameworks that are started when ssh'd into a different computer (than the master), the framework stalls out at:

 "I0627 15:52:44.045642 10254 sched.cpp:230] No credentials provided. Attempting to register without authentication." The task also fails to appear on the mesos tracking website. However, frameworks that are launches from the same computer as the master using the local IP do execute normally.

I am wondering if you have encountered this issue. I have been searching the web for a solution, but have not been able to find one. I would really appreciate any insights you might have.