You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Pradeep Kiruvale <pr...@gmail.com> on 2015/10/01 18:55:00 UTC
Running a task in Mesos cluster
Hi All,
I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3 Slaves.
One slave runs on the Master Node itself and Other slaves run on different
nodes. Here node means the physical boxes.
I tried running the tasks by configuring one Node cluster. Tested the task
scheduling using mesos-execute, works fine.
When I configure three Node cluster (1master and 3 slaves) and try to see
the resources on the master (in GUI) only the Master node resources are
visible.
The other nodes resources are not visible. Some times visible but in a
de-actived state.
*Please let me know what could be the reason. All the nodes are in the same
network. *
When I try to schedule a task using
/src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
--command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
--resources="cpus(*):3;mem(*):2560"
The tasks always get scheduled on the same node. The resources from the
other nodes are not getting used to schedule the tasks.
I*s it required to register the frameworks from every slave node on the
Master?*
*I have configured this cluster using the git-hub code.*
Thanks & Regards,
Pradeep
Re: Running a task in Mesos cluster
Posted by Ondrej Smola <on...@gmail.com>.
Yes there should be configuration options for this in mesos configuration -
see documentation. I am leaving now so i wont be able to respond till Sunday
2015-10-03 11:18 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
> I have different login names for different system. I have a client system,
> from where I launch the tasks. But these tasks are not getting any
> resources. So, they are not getting scheduled.
>
> I mean to say my cluster arrangement is 1 client, 1 Master, 3 slaves. All
> are different physical systems.
>
> Is there any way of run the tasks under one unified user?
>
> Regards,
> Pradeep
>
> On 3 October 2015 at 10:43, Ondrej Smola <on...@gmail.com> wrote:
>
>>
>> mesos framework receive offers and based on those offers it decides where
>> to run tasks.
>>
>>
>> mesos-execute is little framework that executes your task (hackbench) -
>> see here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp
>>
>> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
>> can see that it uses user that run mesos-execute command
>>
>> error you can see should be from here (su command)
>>
>> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520
>>
>> under which user do you run mesos-execute and mesos daemons?
>>
>> 2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>>
>>> Hi Ondrej,
>>>
>>> Thanks for your reply
>>>
>>> I did solve that issue, yes you are right there was an issue with slave
>>> IP address setting.
>>>
>>> Now I am facing issue with the scheduling the tasks. When I try to
>>> schedule a task using
>>>
>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>> --resources="cpus(*):3;mem(*):2560"
>>>
>>> The tasks always get scheduled on the same node. The resources from the
>>> other nodes are not getting used to schedule the tasks.
>>>
>>> I just start the mesos slaves like below
>>>
>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>> --hostname=slave1
>>>
>>> If I submit the task using the above (mesos-execute) command from same
>>> as one of the slave it runs on that system.
>>>
>>> But when I submit the task from some different system. It uses just that
>>> system and queues the tasks not runs on the other slaves.
>>> Some times I see the message "Failed to getgid: unknown user"
>>>
>>> Do I need to start some process to push the task on all the slaves
>>> equally? Am I missing something here?
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>>
>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> the problem is with IP your slave advertise - mesos by default resolves
>>>> your hostname - there are several solutions (let say your node ip is
>>>> 192.168.56.128)
>>>>
>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>> 2) set mesos options - ip, hostname
>>>>
>>>> one way to do this is to create files
>>>>
>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>
>>>> for more configuration options see
>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>
>>>> :
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> Thanks for reply. I found one interesting log message.
>>>>>
>>>>> 7410 master.cpp:5977] Removed slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>> registered at the same address
>>>>>
>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>> registered and de-registered to make a room for the next node. I can even
>>>>> see this on
>>>>> the UI interface, for some time one node got added and after some time
>>>>> that will be replaced with the new slave node.
>>>>>
>>>>> The above log is followed by the below log messages.
>>>>>
>>>>>
>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>> bytes) to leveldb took 104089ns
>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 15: Transport endpoint is not connected
>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>> ports(*):[31000-32000]
>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116) disconnected
>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116)
>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 16: Transport endpoint is not connected
>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116)
>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>>>> notice for position 384
>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>> bytes) to leveldb took 95171ns
>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>> leveldb took 20333ns
>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Pradeep
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> Please check some of my questions in line.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>>>> Slaves.
>>>>>>>
>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>
>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>
>>>>>>> When I configure three Node cluster (1master and 3 slaves) and try
>>>>>>> to see the resources on the master (in GUI) only the Master node resources
>>>>>>> are visible.
>>>>>>> The other nodes resources are not visible. Some times visible but
>>>>>>> in a de-actived state.
>>>>>>>
>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>> There should be some logs in either master or slave telling you what is
>>>>>> wrong.
>>>>>>
>>>>>>>
>>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>>> the same network. *
>>>>>>>
>>>>>>> When I try to schedule a task using
>>>>>>>
>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>
>>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>>
>>>>>> Based on your previous question, there is only one node in your
>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>> what is wrong with other three nodes first.
>>>>>>
>>>>>>>
>>>>>>> I*s it required to register the frameworks from every slave node on
>>>>>>> the Master?*
>>>>>>>
>>>>>> It is not required.
>>>>>>
>>>>>>>
>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Pradeep
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
I have different login names for different system. I have a client system,
from where I launch the tasks. But these tasks are not getting any
resources. So, they are not getting scheduled.
I mean to say my cluster arrangement is 1 client, 1 Master, 3 slaves. All
are different physical systems.
Is there any way of run the tasks under one unified user?
Regards,
Pradeep
On 3 October 2015 at 10:43, Ondrej Smola <on...@gmail.com> wrote:
>
> mesos framework receive offers and based on those offers it decides where
> to run tasks.
>
>
> mesos-execute is little framework that executes your task (hackbench) -
> see here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp
>
> https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
> can see that it uses user that run mesos-execute command
>
> error you can see should be from here (su command)
>
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520
>
> under which user do you run mesos-execute and mesos daemons?
>
> 2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>
>> Hi Ondrej,
>>
>> Thanks for your reply
>>
>> I did solve that issue, yes you are right there was an issue with slave
>> IP address setting.
>>
>> Now I am facing issue with the scheduling the tasks. When I try to
>> schedule a task using
>>
>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>> --resources="cpus(*):3;mem(*):2560"
>>
>> The tasks always get scheduled on the same node. The resources from the
>> other nodes are not getting used to schedule the tasks.
>>
>> I just start the mesos slaves like below
>>
>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1
>>
>> If I submit the task using the above (mesos-execute) command from same as
>> one of the slave it runs on that system.
>>
>> But when I submit the task from some different system. It uses just that
>> system and queues the tasks not runs on the other slaves.
>> Some times I see the message "Failed to getgid: unknown user"
>>
>> Do I need to start some process to push the task on all the slaves
>> equally? Am I missing something here?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> the problem is with IP your slave advertise - mesos by default resolves
>>> your hostname - there are several solutions (let say your node ip is
>>> 192.168.56.128)
>>>
>>> 1) export LIBPROCESS_IP=192.168.56.128
>>> 2) set mesos options - ip, hostname
>>>
>>> one way to do this is to create files
>>>
>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>
>>> for more configuration options see
>>> http://mesos.apache.org/documentation/latest/configuration
>>>
>>>
>>>
>>>
>>>
>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>>>
>>>> Hi Guangya,
>>>>
>>>> Thanks for reply. I found one interesting log message.
>>>>
>>>> 7410 master.cpp:5977] Removed slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>> registered at the same address
>>>>
>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>> registered and de-registered to make a room for the next node. I can even
>>>> see this on
>>>> the UI interface, for some time one node got added and after some time
>>>> that will be replaced with the new slave node.
>>>>
>>>> The above log is followed by the below log messages.
>>>>
>>>>
>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 104089ns
>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 15: Transport endpoint is not connected
>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>> ports(*):[31000-32000]
>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) disconnected
>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 16: Transport endpoint is not connected
>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>>> notice for position 384
>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>> bytes) to leveldb took 95171ns
>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 20333ns
>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>
>>>>
>>>> Thanks,
>>>> Pradeep
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Please check some of my questions in line.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>>> Slaves.
>>>>>>
>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>> different nodes. Here node means the physical boxes.
>>>>>>
>>>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>>>> task scheduling using mesos-execute, works fine.
>>>>>>
>>>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>>>> see the resources on the master (in GUI) only the Master node resources are
>>>>>> visible.
>>>>>> The other nodes resources are not visible. Some times visible but in
>>>>>> a de-actived state.
>>>>>>
>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>> There should be some logs in either master or slave telling you what is
>>>>> wrong.
>>>>>
>>>>>>
>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>> the same network. *
>>>>>>
>>>>>> When I try to schedule a task using
>>>>>>
>>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>>
>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>
>>>>> Based on your previous question, there is only one node in your
>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>> what is wrong with other three nodes first.
>>>>>
>>>>>>
>>>>>> I*s it required to register the frameworks from every slave node on
>>>>>> the Master?*
>>>>>>
>>>>> It is not required.
>>>>>
>>>>>>
>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Ondrej Smola <on...@gmail.com>.
mesos framework receive offers and based on those offers it decides where
to run tasks.
mesos-execute is little framework that executes your task (hackbench) - see
here https://github.com/apache/mesos/blob/master/src/cli/execute.cpp
https://github.com/apache/mesos/blob/master/src/cli/execute.cpp#L320 you
can see that it uses user that run mesos-execute command
error you can see should be from here (su command)
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/3rdparty/stout/include/stout/posix/os.hpp#L520
under which user do you run mesos-execute and mesos daemons?
2015-10-02 15:26 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
> Hi Ondrej,
>
> Thanks for your reply
>
> I did solve that issue, yes you are right there was an issue with slave IP
> address setting.
>
> Now I am facing issue with the scheduling the tasks. When I try to
> schedule a task using
>
> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
> --resources="cpus(*):3;mem(*):2560"
>
> The tasks always get scheduled on the same node. The resources from the
> other nodes are not getting used to schedule the tasks.
>
> I just start the mesos slaves like below
>
> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1
>
> If I submit the task using the above (mesos-execute) command from same as
> one of the slave it runs on that system.
>
> But when I submit the task from some different system. It uses just that
> system and queues the tasks not runs on the other slaves.
> Some times I see the message "Failed to getgid: unknown user"
>
> Do I need to start some process to push the task on all the slaves
> equally? Am I missing something here?
>
> Regards,
> Pradeep
>
>
>
> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> the problem is with IP your slave advertise - mesos by default resolves
>> your hostname - there are several solutions (let say your node ip is
>> 192.168.56.128)
>>
>> 1) export LIBPROCESS_IP=192.168.56.128
>> 2) set mesos options - ip, hostname
>>
>> one way to do this is to create files
>>
>> echo "192.168.56.128" > /etc/mesos-slave/ip
>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>
>> for more configuration options see
>> http://mesos.apache.org/documentation/latest/configuration
>>
>>
>>
>>
>>
>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>>
>>> Hi Guangya,
>>>
>>> Thanks for reply. I found one interesting log message.
>>>
>>> 7410 master.cpp:5977] Removed slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>> registered at the same address
>>>
>>> Mostly because of this issue, the systems/slave nodes are getting
>>> registered and de-registered to make a room for the next node. I can even
>>> see this on
>>> the UI interface, for some time one node got added and after some time
>>> that will be replaced with the new slave node.
>>>
>>> The above log is followed by the below log messages.
>>>
>>>
>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 104089ns
>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 15: Transport endpoint is not connected
>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>> ports(*):[31000-32000]
>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) disconnected
>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 16: Transport endpoint is not connected
>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>> notice for position 384
>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>> bytes) to leveldb took 95171ns
>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 20333ns
>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>
>>>
>>> Thanks,
>>> Pradeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> Please check some of my questions in line.
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>> Slaves.
>>>>>
>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>> different nodes. Here node means the physical boxes.
>>>>>
>>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>>> task scheduling using mesos-execute, works fine.
>>>>>
>>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>>> see the resources on the master (in GUI) only the Master node resources are
>>>>> visible.
>>>>> The other nodes resources are not visible. Some times visible but in
>>>>> a de-actived state.
>>>>>
>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>> There should be some logs in either master or slave telling you what is
>>>> wrong.
>>>>
>>>>>
>>>>> *Please let me know what could be the reason. All the nodes are in the
>>>>> same network. *
>>>>>
>>>>> When I try to schedule a task using
>>>>>
>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>
>>>>> The tasks always get scheduled on the same node. The resources from
>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>
>>>> Based on your previous question, there is only one node in your
>>>> cluster, that's why other nodes are not available. We need first identify
>>>> what is wrong with other three nodes first.
>>>>
>>>>>
>>>>> I*s it required to register the frameworks from every slave node on
>>>>> the Master?*
>>>>>
>>>> It is not required.
>>>>
>>>>>
>>>>> *I have configured this cluster using the git-hub code.*
>>>>>
>>>>>
>>>>> Thanks & Regards,
>>>>> Pradeep
>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
Sorry I cannot get too much info from this log message, I see that you are
using balloon_framework, can you try mesos-execute?
Can you please add the option of GLOG_v=1 when start master and append the
whole log since the master start?
Thanks,
Guangya
On Wed, Oct 7, 2015 at 6:17 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Below are the logs from Master.
>
> -Pradeep
>
> 1007 12:16:28.257853 8005 leveldb.cpp:343] Persisting action (20 bytes)
> to leveldb took 119428ns
> I1007 12:16:28.257884 8005 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 18847ns
> I1007 12:16:28.257891 8005 replica.cpp:679] Persisted action at 1440
> I1007 12:16:28.257912 8005 replica.cpp:664] Replica learned TRUNCATE
> action at position 1440
> I1007 12:16:36.666616 8002 http.cpp:336] HTTP GET for /master/state.json
> from 192.168.0.102:40721 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
> I1007 12:16:39.126030 8001 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:39.126428 8001 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [ ]
> E1007 12:16:39.127459 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:39.127535 8000 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:39.127734 8001 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> disconnected
> I1007 12:16:39.127765 8001 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> E1007 12:16:39.127768 8007 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1007 12:16:39.127789 8001 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:39.127879 8006 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:39.127913 8001 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
> failover
> I1007 12:16:39.129273 8005 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon
> Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:39.129312 8005 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:39.129858 8003 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
> I1007 12:16:40.676519 8000 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:40.676678 8000 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [ ]
> I1007 12:16:40.677178 8006 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> E1007 12:16:40.677217 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:40.677409 8000 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> disconnected
> I1007 12:16:40.677441 8000 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:40.677453 8000 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> E1007 12:16:40.677459 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:40.677501 8000 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
> failover
> I1007 12:16:40.677520 8005 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> I1007 12:16:40.678864 8004 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon
> Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:40.678906 8004 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:40.679147 8001 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
> I1007 12:16:41.853121 8002 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:41.853281 8002 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [ ]
> E1007 12:16:41.853806 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:41.853833 8004 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:41.854032 8002 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> disconnected
> I1007 12:16:41.854063 8002 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:41.854076 8002 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> E1007 12:16:41.854080 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:41.854126 8005 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:41.854121 8002 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
> failover
> I1007 12:16:41.855482 8006 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon
> Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:41.855515 8006 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:41.855692 8001 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
> I1007 12:16:42.772830 8000 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:42.772974 8000 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [ ]
> I1007 12:16:42.773470 8004 hierarchical.hpp:515] Added framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> E1007 12:16:42.773495 8007 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 12:16:42.773679 8000 master.cpp:1119] Framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> disconnected
> I1007 12:16:42.773697 8000 master.cpp:2475] Disconnecting framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:42.773708 8000 master.cpp:2499] Deactivating framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> E1007 12:16:42.773710 8007 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1007 12:16:42.773761 8000 master.cpp:1143] Giving framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
> failover
> I1007 12:16:42.773779 8001 hierarchical.hpp:599] Deactivated framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> I1007 12:16:42.775089 8005 master.cpp:4815] Framework failover timeout,
> removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon
> Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:42.775126 8005 master.cpp:5571] Removing framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
> scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
> I1007 12:16:42.775324 8005 hierarchical.hpp:552] Removed framework
> 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
> I1007 12:16:47.665941 8001 http.cpp:336] HTTP GET for /master/state.json
> from 192.168.0.102:40722 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
>
>
> On 7 October 2015 at 12:12, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> Can you please append more log for your master node? Just want to see
>> what is wrong with your master, why the framework start to failover?
>>
>> Thanks,
>>
>> Guangya
>>
>> On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>> I am running a frame work from some other physical node, which is part
>>> of the same network. Still I am getting below messages and the framework
>>> not getting registered.
>>>
>>> Any idea what is the reason?
>>>
>>> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
>>> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
>>> Framework (C++)) at
>>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>>> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
>>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>>> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
>>> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
>>> with fd 13: Transport endpoint is not connected
>>> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
>>> framework 'Balloon Framework (C++)' at
>>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>>> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework
>>> Balloon Framework (C++) with checkpointing disabled and capabilities [ ]
>>> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
>>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
>>> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
>>> with fd 13: Transport endpoint is not connected
>>>
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>> On 5 October 2015 at 13:51, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> I think that the problem might be caused by that you are running the
>>>> lxc container on master node and not sure if there are any port conflict or
>>>> what else wrong.
>>>>
>>>> For my case, I was running the client in a new node but not on master
>>>> node, perhaps you can have a try to put your client on a new node but not
>>>> on master node.
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>>
>>>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> Hmm!...That is strange in my case!
>>>>>
>>>>> If I run from the mesos-execute on one of the slave/master node then
>>>>> the tasks get their resources and they get scheduled well.
>>>>> But if I start the mesos-execute on another node which is neither
>>>>> slave/master then I have this issue.
>>>>>
>>>>> I am using an lxc container on master as a client to launch the tasks.
>>>>> This is also in the same network as master/slaves.
>>>>> And I just launch the task as you did. But the tasks are not getting
>>>>> scheduled.
>>>>>
>>>>>
>>>>> On master the logs are same as I sent you before
>>>>>
>>>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>>
>>>>> On both of the slaves I can see the below logs
>>>>>
>>>>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>>>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%.
>>>>> Max allowed age: 6.047984349521910days
>>>>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>>>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>>>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>>>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>>>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>>>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 by
>>>>> master@192.168.0.102:5050
>>>>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>>
>>>>>
>>>>>
>>>>> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> From your log, seems that the master process is exiting and this
>>>>>> caused the framework fail over to another mesos master. Can you please show
>>>>>> more detail for your issue reproduced steps?
>>>>>>
>>>>>> I did some test by running mesos-execute on a client host which does
>>>>>> not have any mesos service and the task can schedule well.
>>>>>>
>>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>> --command="/bin/sleep 10" --resources="cpus(*):1;mem(*):256"
>>>>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>>>>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>>>>>> master@192.168.0.107:5050
>>>>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>>>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>>> task cluster-test submitted to slave
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>> Received status update TASK_FINISHED for task cluster-test
>>>>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>>>>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>>>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto
>>>>>> mesos
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>>>> external client system running the same cli mesos-execute.
>>>>>>> The tasks are not getting launched. The tasks reach the Master and
>>>>>>> it just drops the requests, below are the logs related to that
>>>>>>>
>>>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>>>> with checkpointing disabled and capabilities [ ]
>>>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>> disconnected
>>>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns
>>>>>>> to failover
>>>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> because the framework has terminated or is inactive
>>>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated:
>>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated:
>>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>>
>>>>>>>
>>>>>>> Can you please tell me what is the reason? The client is in the same
>>>>>>> network as well. But it does not run any master or slave processes.
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Pradeeep
>>>>>>>
>>>>>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>>>> not, are you running to this issue:
>>>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>>>
>>>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Guangya
>>>>>>>>
>>>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guangya,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for sharing the information.
>>>>>>>>>
>>>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>>>> Else I have problem with launching the tasks.
>>>>>>>>>
>>>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>>>> facing the following issue.
>>>>>>>>>
>>>>>>>>> Failed to create a containerizer: Could not create
>>>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>>>
>>>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>>>> scheduled on the other two slave nodes.
>>>>>>>>>
>>>>>>>>> Thanks for your timely help
>>>>>>>>>
>>>>>>>>> -Pradeep
>>>>>>>>>
>>>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> My steps was pretty simple just as
>>>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>>>
>>>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>>>
>>>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>>>> node mesos007, you can see that the two tasks was launched on different
>>>>>>>>>> host.
>>>>>>>>>>
>>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials
>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered
>>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>>> Framework registered with
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>> ^C
>>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials
>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered
>>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>>> Framework registered with
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Guangya
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply.
>>>>>>>>>>>
>>>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>>>
>>>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>>>
>>>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>
>>>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>>>>> The logic is here:
>>>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>>>
>>>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all
>>>>>>>>>>>> of your slaves have enough resources? If you want your task run on other
>>>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I
>>>>>>>>>>>>> try to schedule a task using
>>>>>>>>>>>>>
>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>
>>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>>>>>
>>>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>>>> --hostname=slave1
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <
>>>>>>>>>>>>> ondrej.smola@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by
>>>>>>>>>>>>>> default resolves your hostname - there are several solutions (let say your
>>>>>>>>>>>>>> node ip is 192.168.56.128)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>>> action (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted
>>>>>>>>>>>>>>> action at 384
>>>>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with
>>>>>>>>>>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated:
>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica
>>>>>>>>>>>>>>> received learned notice for position 384
>>>>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>>> action (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2
>>>>>>>>>>>>>>> keys from leveldb took 20333ns
>>>>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted
>>>>>>>>>>>>>>> action at 384
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>>>> The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the
>>>>>>>>>>>>>>>>> nodes are in the same network. *
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I*s it required to register the frameworks from every
>>>>>>>>>>>>>>>>> slave node on the Master?*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Below are the logs from Master.
-Pradeep
1007 12:16:28.257853 8005 leveldb.cpp:343] Persisting action (20 bytes) to
leveldb took 119428ns
I1007 12:16:28.257884 8005 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 18847ns
I1007 12:16:28.257891 8005 replica.cpp:679] Persisted action at 1440
I1007 12:16:28.257912 8005 replica.cpp:664] Replica learned TRUNCATE
action at position 1440
I1007 12:16:36.666616 8002 http.cpp:336] HTTP GET for /master/state.json
from 192.168.0.102:40721 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
I1007 12:16:39.126030 8001 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.126428 8001 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [ ]
E1007 12:16:39.127459 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:39.127535 8000 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:39.127734 8001 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:39.127765 8001 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:39.127768 8007 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1007 12:16:39.127789 8001 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.127879 8006 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:39.127913 8001 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:39.129273 8005 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.129312 8005 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:39.129858 8003 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0000
I1007 12:16:40.676519 8000 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.676678 8000 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [ ]
I1007 12:16:40.677178 8006 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
E1007 12:16:40.677217 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:40.677409 8000 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:40.677441 8000 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.677453 8000 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:40.677459 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:40.677501 8000 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:40.677520 8005 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
I1007 12:16:40.678864 8004 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.678906 8004 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:40.679147 8001 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0001
I1007 12:16:41.853121 8002 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.853281 8002 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [ ]
E1007 12:16:41.853806 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:41.853833 8004 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:41.854032 8002 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:41.854063 8002 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.854076 8002 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:41.854080 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:41.854126 8005 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:41.854121 8002 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:41.855482 8006 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.855515 8006 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:41.855692 8001 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0002
I1007 12:16:42.772830 8000 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.772974 8000 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [ ]
I1007 12:16:42.773470 8004 hierarchical.hpp:515] Added framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
E1007 12:16:42.773495 8007 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 12:16:42.773679 8000 master.cpp:1119] Framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 disconnected
I1007 12:16:42.773697 8000 master.cpp:2475] Disconnecting framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.773708 8000 master.cpp:2499] Deactivating framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
E1007 12:16:42.773710 8007 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1007 12:16:42.773761 8000 master.cpp:1143] Giving framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843 0ns to
failover
I1007 12:16:42.773779 8001 hierarchical.hpp:599] Deactivated framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
I1007 12:16:42.775089 8005 master.cpp:4815] Framework failover timeout,
removing framework 0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon
Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.775126 8005 master.cpp:5571] Removing framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003 (Balloon Framework (C++)) at
scheduler-a8deafaa-cf10-401c-a61c-515340560c49@127.0.1.1:58843
I1007 12:16:42.775324 8005 hierarchical.hpp:552] Removed framework
0ccab17d-20e8-4ab8-9de4-ae60691f8c8e-0003
I1007 12:16:47.665941 8001 http.cpp:336] HTTP GET for /master/state.json
from 192.168.0.102:40722 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.52 Safari/537.36'
On 7 October 2015 at 12:12, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> Can you please append more log for your master node? Just want to see what
> is wrong with your master, why the framework start to failover?
>
> Thanks,
>
> Guangya
>
> On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> I am running a frame work from some other physical node, which is part of
>> the same network. Still I am getting below messages and the framework not
>> getting registered.
>>
>> Any idea what is the reason?
>>
>> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
>> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
>> Framework (C++)) at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
>> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
>> with fd 13: Transport endpoint is not connected
>> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
>> framework 'Balloon Framework (C++)' at
>> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
>> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework
>> Balloon Framework (C++) with checkpointing disabled and capabilities [ ]
>> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
>> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
>> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
>> with fd 13: Transport endpoint is not connected
>>
>>
>> Regards,
>> Pradeep
>>
>>
>> On 5 October 2015 at 13:51, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> I think that the problem might be caused by that you are running the lxc
>>> container on master node and not sure if there are any port conflict or
>>> what else wrong.
>>>
>>> For my case, I was running the client in a new node but not on master
>>> node, perhaps you can have a try to put your client on a new node but not
>>> on master node.
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>>
>>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> Hmm!...That is strange in my case!
>>>>
>>>> If I run from the mesos-execute on one of the slave/master node then
>>>> the tasks get their resources and they get scheduled well.
>>>> But if I start the mesos-execute on another node which is neither
>>>> slave/master then I have this issue.
>>>>
>>>> I am using an lxc container on master as a client to launch the tasks.
>>>> This is also in the same network as master/slaves.
>>>> And I just launch the task as you did. But the tasks are not getting
>>>> scheduled.
>>>>
>>>>
>>>> On master the logs are same as I sent you before
>>>>
>>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>
>>>> On both of the slaves I can see the below logs
>>>>
>>>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%.
>>>> Max allowed age: 6.047984349521910days
>>>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066 by
>>>> master@192.168.0.102:5050
>>>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>>
>>>>
>>>>
>>>> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> From your log, seems that the master process is exiting and this
>>>>> caused the framework fail over to another mesos master. Can you please show
>>>>> more detail for your issue reproduced steps?
>>>>>
>>>>> I did some test by running mesos-execute on a client host which does
>>>>> not have any mesos service and the task can schedule well.
>>>>>
>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>> --command="/bin/sleep 10" --resources="cpus(*):1;mem(*):256"
>>>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>>>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>>>>> master@192.168.0.107:5050
>>>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>>> task cluster-test submitted to slave
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>> Received status update TASK_FINISHED for task cluster-test
>>>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>>>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>>
>>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>>> external client system running the same cli mesos-execute.
>>>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>>>> just drops the requests, below are the logs related to that
>>>>>>
>>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>>> with checkpointing disabled and capabilities [ ]
>>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> disconnected
>>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 14: Transport endpoint is not connected
>>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns
>>>>>> to failover
>>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> because the framework has terminated or is inactive
>>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated:
>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated:
>>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>>
>>>>>>
>>>>>> Can you please tell me what is the reason? The client is in the same
>>>>>> network as well. But it does not run any master or slave processes.
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeeep
>>>>>>
>>>>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>>> not, are you running to this issue:
>>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>>
>>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for sharing the information.
>>>>>>>>
>>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>>> Else I have problem with launching the tasks.
>>>>>>>>
>>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>>> facing the following issue.
>>>>>>>>
>>>>>>>> Failed to create a containerizer: Could not create
>>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>>
>>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>>> scheduled on the other two slave nodes.
>>>>>>>>
>>>>>>>> Thanks for your timely help
>>>>>>>>
>>>>>>>> -Pradeep
>>>>>>>>
>>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> My steps was pretty simple just as
>>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>>
>>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>>
>>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>>> node mesos007, you can see that the two tasks was launched on different
>>>>>>>>> host.
>>>>>>>>>
>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials
>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered
>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>> ^C
>>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>>>>> master@192.168.0.107:5050
>>>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials
>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered
>>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>>> task cluster-test submitted to slave
>>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Guangya
>>>>>>>>>
>>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guangya,
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply.
>>>>>>>>>>
>>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>>
>>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>>
>>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>>>> The logic is here:
>>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>>
>>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>>
>>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>>
>>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>>>> to schedule a task using
>>>>>>>>>>>>
>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>
>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>
>>>>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>>>>
>>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>>> --hostname=slave1
>>>>>>>>>>>>
>>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>>
>>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>>
>>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <
>>>>>>>>>>>> ondrej.smola@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>
>>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>>>>
>>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>>
>>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>>
>>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>> action (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action
>>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting
>>>>>>>>>>>>>> action (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action
>>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>>> The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I*s it required to register the frameworks from every
>>>>>>>>>>>>>>>> slave node on the Master?*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
Can you please append more log for your master node? Just want to see what
is wrong with your master, why the framework start to failover?
Thanks,
Guangya
On Wed, Oct 7, 2015 at 5:27 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Guangya,
>
> I am running a frame work from some other physical node, which is part of
> the same network. Still I am getting below messages and the framework not
> getting registered.
>
> Any idea what is the reason?
>
> I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
> removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
> Framework (C++)) at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
> E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
> I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
> framework 'Balloon Framework (C++)' at
> scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
> I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework Balloon
> Framework (C++) with checkpointing disabled and capabilities [ ]
> I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
> 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
> E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
> with fd 13: Transport endpoint is not connected
>
>
> Regards,
> Pradeep
>
>
> On 5 October 2015 at 13:51, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> I think that the problem might be caused by that you are running the lxc
>> container on master node and not sure if there are any port conflict or
>> what else wrong.
>>
>> For my case, I was running the client in a new node but not on master
>> node, perhaps you can have a try to put your client on a new node but not
>> on master node.
>>
>> Thanks,
>>
>> Guangya
>>
>>
>> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>> Hmm!...That is strange in my case!
>>>
>>> If I run from the mesos-execute on one of the slave/master node then the
>>> tasks get their resources and they get scheduled well.
>>> But if I start the mesos-execute on another node which is neither
>>> slave/master then I have this issue.
>>>
>>> I am using an lxc container on master as a client to launch the tasks.
>>> This is also in the same network as master/slaves.
>>> And I just launch the task as you did. But the tasks are not getting
>>> scheduled.
>>>
>>>
>>> On master the logs are same as I sent you before
>>>
>>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>
>>> On both of the slaves I can see the below logs
>>>
>>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
>>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%.
>>> Max allowed age: 6.047984349521910days
>>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
>>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
>>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
>>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
>>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
>>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
>>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>>
>>>
>>>
>>> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> From your log, seems that the master process is exiting and this caused
>>>> the framework fail over to another mesos master. Can you please show more
>>>> detail for your issue reproduced steps?
>>>>
>>>> I did some test by running mesos-execute on a client host which does
>>>> not have any mesos service and the task can schedule well.
>>>>
>>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>>>> --resources="cpus(*):1;mem(*):256"
>>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>>>> master@192.168.0.107:5050
>>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>>>> Attempting to register without authentication
>>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>>> task cluster-test submitted to slave
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>>> Received status update TASK_RUNNING for task cluster-test
>>>> Received status update TASK_FINISHED for task cluster-test
>>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>>
>>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>>> external client system running the same cli mesos-execute.
>>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>>> just drops the requests, below are the logs related to that
>>>>>
>>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>>> with checkpointing disabled and capabilities [ ]
>>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 14: Transport endpoint is not connected
>>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> disconnected
>>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 14: Transport endpoint is not connected
>>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>>>>> failover
>>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated
>>>>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning
>>>>> resources offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> because the framework has terminated or is inactive
>>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered
>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total:
>>>>> cpus(*):8; mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated:
>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered
>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total:
>>>>> cpus(*):8; mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated:
>>>>> ) on slave 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>>
>>>>>
>>>>> Can you please tell me what is the reason? The client is in the same
>>>>> network as well. But it does not run any master or slave processes.
>>>>>
>>>>> Thanks & Regards,
>>>>> Pradeeep
>>>>>
>>>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> Glad it finally works! Not sure if you are using systemd.slice or
>>>>>> not, are you running to this issue:
>>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>>
>>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>>
>>>>>>> Thanks for sharing the information.
>>>>>>>
>>>>>>> Now I could launch the tasks. The problem was with the permission.
>>>>>>> If I start all the slaves and Master as root it works fine.
>>>>>>> Else I have problem with launching the tasks.
>>>>>>>
>>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>>> facing the following issue.
>>>>>>>
>>>>>>> Failed to create a containerizer: Could not create
>>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>>
>>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>>> scheduled on the other two slave nodes.
>>>>>>>
>>>>>>> Thanks for your timely help
>>>>>>>
>>>>>>> -Pradeep
>>>>>>>
>>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> My steps was pretty simple just as
>>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>>
>>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build#
>>>>>>>> GLOG_v=1 ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>>
>>>>>>>> Then schedule a task on any of the node, here I was using slave
>>>>>>>> node mesos007, you can see that the two tasks was launched on different
>>>>>>>> host.
>>>>>>>>
>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>>>> master@192.168.0.107:5050
>>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>>>>>> Attempting to register without authentication
>>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered
>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>>> task cluster-test submitted to slave
>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>> ^C
>>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>>>> master@192.168.0.107:5050
>>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>>>>>> Attempting to register without authentication
>>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered
>>>>>>>> with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>>> task cluster-test submitted to slave
>>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Guangya
>>>>>>>>
>>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guangya,
>>>>>>>>>
>>>>>>>>> Thanks for your reply.
>>>>>>>>>
>>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>>
>>>>>>>>> 1. What processes you have started on Master?
>>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>>
>>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>>
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Pradeep
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>>> The logic is here:
>>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>>> The allocator will help random shuffle the slaves every time
>>>>>>>>>> when allocate resources for offers.
>>>>>>>>>>
>>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply
>>>>>>>>>>>
>>>>>>>>>>> I did solve that issue, yes you are right there was an issue
>>>>>>>>>>> with slave IP address setting.
>>>>>>>>>>>
>>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>>> to schedule a task using
>>>>>>>>>>>
>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>
>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>
>>>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>>>
>>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>>> --hostname=slave1
>>>>>>>>>>>
>>>>>>>>>>> If I submit the task using the above (mesos-execute) command
>>>>>>>>>>> from same as one of the slave it runs on that system.
>>>>>>>>>>>
>>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>>
>>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <ondrej.smola@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>
>>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>>
>>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>>>
>>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>>
>>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>>
>>>>>>>>>>>> for more configuration options see
>>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>>> can even see this on
>>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action
>>>>>>>>>>>>> at 384
>>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to
>>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating
>>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action
>>>>>>>>>>>>> at 384
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves
>>>>>>>>>>>>>>> run on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>>> The other nodes resources are not visible. Some times
>>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
I am running a frame work from some other physical node, which is part of
the same network. Still I am getting below messages and the framework not
getting registered.
Any idea what is the reason?
I1007 11:24:58.781914 32392 master.cpp:4815] Framework failover timeout,
removing framework 89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon
Framework (C++)) at
scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
I1007 11:24:58.781968 32392 master.cpp:5571] Removing framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019 (Balloon Framework (C++)) at
scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
I1007 11:24:58.782352 32392 hierarchical.hpp:552] Removed framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0019
E1007 11:24:58.782577 32399 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
I1007 11:24:59.699587 32396 master.cpp:2179] Received SUBSCRIBE call for
framework 'Balloon Framework (C++)' at
scheduler-3848d80c-8d27-48e0-a6b7-7e1678d5401d@127.0.1.1:54203
I1007 11:24:59.699717 32396 master.cpp:2250] Subscribing framework Balloon
Framework (C++) with checkpointing disabled and capabilities [ ]
I1007 11:24:59.700251 32393 hierarchical.hpp:515] Added framework
89b179d8-9fb7-4a61-ad03-a9a5525482ff-0020
E1007 11:24:59.700253 32399 process.cpp:1912] Failed to shutdown socket
with fd 13: Transport endpoint is not connected
Regards,
Pradeep
On 5 October 2015 at 13:51, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> I think that the problem might be caused by that you are running the lxc
> container on master node and not sure if there are any port conflict or
> what else wrong.
>
> For my case, I was running the client in a new node but not on master
> node, perhaps you can have a try to put your client on a new node but not
> on master node.
>
> Thanks,
>
> Guangya
>
>
> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> Hmm!...That is strange in my case!
>>
>> If I run from the mesos-execute on one of the slave/master node then the
>> tasks get their resources and they get scheduled well.
>> But if I start the mesos-execute on another node which is neither
>> slave/master then I have this issue.
>>
>> I am using an lxc container on master as a client to launch the tasks.
>> This is also in the same network as master/slaves.
>> And I just launch the task as you did. But the tasks are not getting
>> scheduled.
>>
>>
>> On master the logs are same as I sent you before
>>
>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>> On both of the slaves I can see the below logs
>>
>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%. Max
>> allowed age: 6.047984349521910days
>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>>
>>
>> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> From your log, seems that the master process is exiting and this caused
>>> the framework fail over to another mesos master. Can you please show more
>>> detail for your issue reproduced steps?
>>>
>>> I did some test by running mesos-execute on a client host which does not
>>> have any mesos service and the task can schedule well.
>>>
>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>>> master@192.168.0.107:5050
>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>> Received status update TASK_RUNNING for task cluster-test
>>> Received status update TASK_FINISHED for task cluster-test
>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>>
>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>> external client system running the same cli mesos-execute.
>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>> just drops the requests, below are the logs related to that
>>>>
>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>> with checkpointing disabled and capabilities [ ]
>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> disconnected
>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>>>> failover
>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
>>>> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
>>>> framework has terminated or is inactive
>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>
>>>>
>>>> Can you please tell me what is the reason? The client is in the same
>>>> network as well. But it does not run any master or slave processes.
>>>>
>>>> Thanks & Regards,
>>>> Pradeeep
>>>>
>>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Glad it finally works! Not sure if you are using systemd.slice or not,
>>>>> are you running to this issue:
>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>
>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>>
>>>>>> Thanks for sharing the information.
>>>>>>
>>>>>> Now I could launch the tasks. The problem was with the permission. If
>>>>>> I start all the slaves and Master as root it works fine.
>>>>>> Else I have problem with launching the tasks.
>>>>>>
>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>> facing the following issue.
>>>>>>
>>>>>> Failed to create a containerizer: Could not create
>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>
>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>> scheduled on the other two slave nodes.
>>>>>>
>>>>>> Thanks for your timely help
>>>>>>
>>>>>> -Pradeep
>>>>>>
>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> My steps was pretty simple just as
>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>
>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>
>>>>>>> Then schedule a task on any of the node, here I was using slave node
>>>>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>>>>
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>>> master@192.168.0.107:5050
>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>> ^C
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>>> master@192.168.0.107:5050
>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>
>>>>>>>> 1. What processes you have started on Master?
>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>
>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>
>>>>>>>> Thanks & Regards,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>> The logic is here:
>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>>>>> allocate resources for offers.
>>>>>>>>>
>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply
>>>>>>>>>>
>>>>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>>>>> slave IP address setting.
>>>>>>>>>>
>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>> to schedule a task using
>>>>>>>>>>
>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>
>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>
>>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>>
>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>> --hostname=slave1
>>>>>>>>>>
>>>>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>>>>> same as one of the slave it runs on that system.
>>>>>>>>>>
>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>
>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>
>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>>
>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>
>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>
>>>>>>>>>>> for more configuration options see
>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>
>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>
>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>> can even see this on
>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>
>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run
>>>>>>>>>>>>>> on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>> The other nodes resources are not visible. Some times
>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
Thanks for the reply.
I also think the same. I found one of this old e-mail thread where in the
same thing was discussed.
He set up a client on a separate physical system, then it started working
fine.
I will also try and see.
Regards,
Pradeep
On 5 October 2015 at 13:51, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> I think that the problem might be caused by that you are running the lxc
> container on master node and not sure if there are any port conflict or
> what else wrong.
>
> For my case, I was running the client in a new node but not on master
> node, perhaps you can have a try to put your client on a new node but not
> on master node.
>
> Thanks,
>
> Guangya
>
>
> On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> Hmm!...That is strange in my case!
>>
>> If I run from the mesos-execute on one of the slave/master node then the
>> tasks get their resources and they get scheduled well.
>> But if I start the mesos-execute on another node which is neither
>> slave/master then I have this issue.
>>
>> I am using an lxc container on master as a client to launch the tasks.
>> This is also in the same network as master/slaves.
>> And I just launch the task as you did. But the tasks are not getting
>> scheduled.
>>
>>
>> On master the logs are same as I sent you before
>>
>> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>> On both of the slaves I can see the below logs
>>
>> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
>> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
>> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%. Max
>> allowed age: 6.047984349521910days
>> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
>> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
>> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
>> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
>> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
>> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
>> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
>> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
>> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
>> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
>> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
>> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
>> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>>
>>
>>
>> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> From your log, seems that the master process is exiting and this caused
>>> the framework fail over to another mesos master. Can you please show more
>>> detail for your issue reproduced steps?
>>>
>>> I did some test by running mesos-execute on a client host which does not
>>> have any mesos service and the task can schedule well.
>>>
>>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>>> master@192.168.0.107:5050
>>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>>> Received status update TASK_RUNNING for task cluster-test
>>> Received status update TASK_FINISHED for task cluster-test
>>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>>
>>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> I am facing one more issue. If I try to schedule the tasks from some
>>>> external client system running the same cli mesos-execute.
>>>> The tasks are not getting launched. The tasks reach the Master and it
>>>> just drops the requests, below are the logs related to that
>>>>
>>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework
>>>> with checkpointing disabled and capabilities [ ]
>>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> disconnected
>>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
>>>> with fd 14: Transport endpoint is not connected
>>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>>>> failover
>>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
>>>> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
>>>> framework has terminated or is inactive
>>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover
>>>> timeout, removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>>
>>>>
>>>> Can you please tell me what is the reason? The client is in the same
>>>> network as well. But it does not run any master or slave processes.
>>>>
>>>> Thanks & Regards,
>>>> Pradeeep
>>>>
>>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Glad it finally works! Not sure if you are using systemd.slice or not,
>>>>> are you running to this issue:
>>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>>
>>>>> Hope Jie Yu can give you some help on this ;-)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>>
>>>>>> Thanks for sharing the information.
>>>>>>
>>>>>> Now I could launch the tasks. The problem was with the permission. If
>>>>>> I start all the slaves and Master as root it works fine.
>>>>>> Else I have problem with launching the tasks.
>>>>>>
>>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>>> facing the following issue.
>>>>>>
>>>>>> Failed to create a containerizer: Could not create
>>>>>> MesosContainerizer: Failed to create launcher: Failed to create Linux
>>>>>> launcher: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/freezer':
>>>>>> 'freezer' is already attached to another hierarchy
>>>>>>
>>>>>> I took that out from the cluster for now. The tasks are getting
>>>>>> scheduled on the other two slave nodes.
>>>>>>
>>>>>> Thanks for your timely help
>>>>>>
>>>>>> -Pradeep
>>>>>>
>>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> My steps was pretty simple just as
>>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>>
>>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>>
>>>>>>> Then schedule a task on any of the node, here I was using slave node
>>>>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>>>>
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>>> master@192.168.0.107:5050
>>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>> ^C
>>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>>> master@192.168.0.107:5050
>>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>>>>> Attempting to register without authentication
>>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>>> task cluster-test submitted to slave
>>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>>
>>>>>>>> 1. What processes you have started on Master?
>>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>>
>>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>>> What I am missing is some configuration steps.
>>>>>>>>
>>>>>>>> Thanks & Regards,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>>> The logic is here:
>>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>>>>> allocate resources for offers.
>>>>>>>>>
>>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ondrej,
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply
>>>>>>>>>>
>>>>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>>>>> slave IP address setting.
>>>>>>>>>>
>>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try
>>>>>>>>>> to schedule a task using
>>>>>>>>>>
>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>
>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>
>>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>>
>>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>>> --hostname=slave1
>>>>>>>>>>
>>>>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>>>>> same as one of the slave it runs on that system.
>>>>>>>>>>
>>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>>
>>>>>>>>>> Do I need to start some process to push the task on all the
>>>>>>>>>> slaves equally? Am I missing something here?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>>
>>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>>
>>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>>
>>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>>
>>>>>>>>>>> for more configuration options see
>>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>>
>>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>>> registered at the same address
>>>>>>>>>>>>
>>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>>> can even see this on
>>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>>
>>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting
>>>>>>>>>>>> slave 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to
>>>>>>>>>>>> shutdown socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action
>>>>>>>>>>>> at 384
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guangya
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1
>>>>>>>>>>>>>> Master and 3 Slaves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run
>>>>>>>>>>>>>> on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves)
>>>>>>>>>>>>>> and try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>>> The other nodes resources are not visible. Some times
>>>>>>>>>>>>>> visible but in a de-actived state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tasks always get scheduled on the same node. The
>>>>>>>>>>>>>> resources from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on your previous question, there is only one node in
>>>>>>>>>>>>> your cluster, that's why other nodes are not available. We need first
>>>>>>>>>>>>> identify what is wrong with other three nodes first.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
I think that the problem might be caused by that you are running the lxc
container on master node and not sure if there are any port conflict or
what else wrong.
For my case, I was running the client in a new node but not on master node,
perhaps you can have a try to put your client on a new node but not on
master node.
Thanks,
Guangya
On Mon, Oct 5, 2015 at 7:30 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Guangya,
>
> Hmm!...That is strange in my case!
>
> If I run from the mesos-execute on one of the slave/master node then the
> tasks get their resources and they get scheduled well.
> But if I start the mesos-execute on another node which is neither
> slave/master then I have this issue.
>
> I am using an lxc container on master as a client to launch the tasks.
> This is also in the same network as master/slaves.
> And I just launch the task as you did. But the tasks are not getting
> scheduled.
>
>
> On master the logs are same as I sent you before
>
> Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>
> On both of the slaves I can see the below logs
>
> I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
> W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
> I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%. Max
> allowed age: 6.047984349521910days
> I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
> W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
> I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
> W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
> I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
> W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
> I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
> W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
> I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
> W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
> I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
> W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
> framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
>
>
>
> On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> From your log, seems that the master process is exiting and this caused
>> the framework fail over to another mesos master. Can you please show more
>> detail for your issue reproduced steps?
>>
>> I did some test by running mesos-execute on a client host which does not
>> have any mesos service and the task can schedule well.
>>
>> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
>> --resources="cpus(*):1;mem(*):256"
>> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
>> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
>> master@192.168.0.107:5050
>> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
>> Attempting to register without authentication
>> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
>> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
>> task cluster-test submitted to slave
>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
>> Received status update TASK_RUNNING for task cluster-test
>> Received status update TASK_FINISHED for task cluster-test
>> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
>> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
>> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
>> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
>> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>>
>> Thanks,
>>
>> Guangya
>>
>>
>> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>> I am facing one more issue. If I try to schedule the tasks from some
>>> external client system running the same cli mesos-execute.
>>> The tasks are not getting launched. The tasks reach the Master and it
>>> just drops the requests, below are the logs related to that
>>>
>>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework with
>>> checkpointing disabled and capabilities [ ]
>>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
>>> with fd 14: Transport endpoint is not connected
>>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>> disconnected
>>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
>>> with fd 14: Transport endpoint is not connected
>>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>>> failover
>>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
>>> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
>>> framework has terminated or is inactive
>>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
>>> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
>>> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
>>> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout,
>>> removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>>
>>>
>>> Can you please tell me what is the reason? The client is in the same
>>> network as well. But it does not run any master or slave processes.
>>>
>>> Thanks & Regards,
>>> Pradeeep
>>>
>>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> Glad it finally works! Not sure if you are using systemd.slice or not,
>>>> are you running to this issue:
>>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>>
>>>> Hope Jie Yu can give you some help on this ;-)
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>>
>>>>> Thanks for sharing the information.
>>>>>
>>>>> Now I could launch the tasks. The problem was with the permission. If
>>>>> I start all the slaves and Master as root it works fine.
>>>>> Else I have problem with launching the tasks.
>>>>>
>>>>> But on one of the slave I could not launch the slave as root, I am
>>>>> facing the following issue.
>>>>>
>>>>> Failed to create a containerizer: Could not create MesosContainerizer:
>>>>> Failed to create launcher: Failed to create Linux launcher: Failed to mount
>>>>> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
>>>>> attached to another hierarchy
>>>>>
>>>>> I took that out from the cluster for now. The tasks are getting
>>>>> scheduled on the other two slave nodes.
>>>>>
>>>>> Thanks for your timely help
>>>>>
>>>>> -Pradeep
>>>>>
>>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> My steps was pretty simple just as
>>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>>
>>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>>
>>>>>> Then schedule a task on any of the node, here I was using slave node
>>>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>>>
>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>>> master@192.168.0.107:5050
>>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>>> task cluster-test submitted to slave
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>> ^C
>>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>>> master@192.168.0.107:5050
>>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>>> task cluster-test submitted to slave
>>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> I just want to know how did you launch the tasks.
>>>>>>>
>>>>>>> 1. What processes you have started on Master?
>>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>>
>>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>>> What I am missing is some configuration steps.
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Pradeep
>>>>>>>
>>>>>>>
>>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> I did some test with your case and found that the task can run
>>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>>> The logic is here:
>>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>>>> allocate resources for offers.
>>>>>>>>
>>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Ondrej,
>>>>>>>>>
>>>>>>>>> Thanks for your reply
>>>>>>>>>
>>>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>>>> slave IP address setting.
>>>>>>>>>
>>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>>>>>> schedule a task using
>>>>>>>>>
>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>
>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>
>>>>>>>>> I just start the mesos slaves like below
>>>>>>>>>
>>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>>> --hostname=slave1
>>>>>>>>>
>>>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>>>> same as one of the slave it runs on that system.
>>>>>>>>>
>>>>>>>>> But when I submit the task from some different system. It uses
>>>>>>>>> just that system and queues the tasks not runs on the other slaves.
>>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>>
>>>>>>>>> Do I need to start some process to push the task on all the slaves
>>>>>>>>> equally? Am I missing something here?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Pradeep
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>>> is 192.168.56.128)
>>>>>>>>>>
>>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>>
>>>>>>>>>> one way to do this is to create files
>>>>>>>>>>
>>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>>
>>>>>>>>>> for more configuration options see
>>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Hi Guangya,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>>
>>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>>> registered at the same address
>>>>>>>>>>>
>>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are
>>>>>>>>>>> getting registered and de-registered to make a room for the next node. I
>>>>>>>>>>> can even see this on
>>>>>>>>>>> the UI interface, for some time one node got added and after
>>>>>>>>>>> some time that will be replaced with the new slave node.
>>>>>>>>>>>
>>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action
>>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at
>>>>>>>>>>> 384
>>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>>> learned notice for position 384
>>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action
>>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at
>>>>>>>>>>> 384
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>>
>>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Guangya
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master
>>>>>>>>>>>>> and 3 Slaves.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run
>>>>>>>>>>>>> on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and
>>>>>>>>>>>>> try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>>> The other nodes resources are not visible. Some times visible
>>>>>>>>>>>>> but in a de-actived state.
>>>>>>>>>>>>>
>>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes
>>>>>>>>>>>>> are in the same network. *
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>>
>>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>>
>>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>>
>>>>>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>>
>>>>>>>>>>>> It is not required.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
Hmm!...That is strange in my case!
If I run from the mesos-execute on one of the slave/master node then the
tasks get their resources and they get scheduled well.
But if I start the mesos-execute on another node which is neither
slave/master then I have this issue.
I am using an lxc container on master as a client to launch the tasks. This
is also in the same network as master/slaves.
And I just launch the task as you did. But the tasks are not getting
scheduled.
On master the logs are same as I sent you before
Deactivating framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
On both of the slaves I can see the below logs
I1005 13:23:32.547987 4831 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0060 by master@192.168.0.102:5050
W1005 13:23:32.548135 4831 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0060
I1005 13:23:33.697707 4833 slave.cpp:3926] Current disk usage 3.60%. Max
allowed age: 6.047984349521910days
I1005 13:23:34.098599 4829 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0061 by master@192.168.0.102:5050
W1005 13:23:34.098740 4829 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0061
I1005 13:23:35.274569 4831 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0062 by master@192.168.0.102:5050
W1005 13:23:35.274683 4831 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0062
I1005 13:23:36.193964 4829 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0063 by master@192.168.0.102:5050
W1005 13:23:36.194090 4829 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0063
I1005 13:24:01.914788 4827 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0064 by master@192.168.0.102:5050
W1005 13:24:01.914937 4827 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0064
I1005 13:24:03.469974 4833 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0065 by master@192.168.0.102:5050
W1005 13:24:03.470118 4833 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0065
I1005 13:24:04.642654 4826 slave.cpp:1980] Asked to shut down framework
77539063-89ce-4efa-a20b-ca788abbd912-0066 by master@192.168.0.102:5050
W1005 13:24:04.642812 4826 slave.cpp:1995] Cannot shut down unknown
framework 77539063-89ce-4efa-a20b-ca788abbd912-0066
On 5 October 2015 at 13:09, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> From your log, seems that the master process is exiting and this caused
> the framework fail over to another mesos master. Can you please show more
> detail for your issue reproduced steps?
>
> I did some test by running mesos-execute on a client host which does not
> have any mesos service and the task can schedule well.
>
> root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
> --resources="cpus(*):1;mem(*):256"
> I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
> I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
> master@192.168.0.107:5050
> I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
> Attempting to register without authentication
> I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
> 04b9af5e-e9b6-4c59-8734-eba407163922-0002
> Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
> task cluster-test submitted to slave
> c0e5fdde-595e-4768-9d04-25901d4523b6-S0
> Received status update TASK_RUNNING for task cluster-test
> Received status update TASK_FINISHED for task cluster-test
> I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
> I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
> '04b9af5e-e9b6-4c59-8734-eba407163922-0002'
> root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
> root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
>
> Thanks,
>
> Guangya
>
>
> On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> I am facing one more issue. If I try to schedule the tasks from some
>> external client system running the same cli mesos-execute.
>> The tasks are not getting launched. The tasks reach the Master and it
>> just drops the requests, below are the logs related to that
>>
>> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework with
>> checkpointing disabled and capabilities [ ]
>> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
>> with fd 14: Transport endpoint is not connected
>> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>> disconnected
>> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
>> with fd 14: Transport endpoint is not connected
>> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
>> failover
>> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
>> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
>> framework has terminated or is inactive
>> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
>> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
>> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
>> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
>> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055
>> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout,
>> removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
>> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
>> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>>
>>
>> Can you please tell me what is the reason? The client is in the same
>> network as well. But it does not run any master or slave processes.
>>
>> Thanks & Regards,
>> Pradeeep
>>
>> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> Glad it finally works! Not sure if you are using systemd.slice or not,
>>> are you running to this issue:
>>> https://issues.apache.org/jira/browse/MESOS-1195
>>>
>>> Hope Jie Yu can give you some help on this ;-)
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>>
>>>> Thanks for sharing the information.
>>>>
>>>> Now I could launch the tasks. The problem was with the permission. If I
>>>> start all the slaves and Master as root it works fine.
>>>> Else I have problem with launching the tasks.
>>>>
>>>> But on one of the slave I could not launch the slave as root, I am
>>>> facing the following issue.
>>>>
>>>> Failed to create a containerizer: Could not create MesosContainerizer:
>>>> Failed to create launcher: Failed to create Linux launcher: Failed to mount
>>>> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
>>>> attached to another hierarchy
>>>>
>>>> I took that out from the cluster for now. The tasks are getting
>>>> scheduled on the other two slave nodes.
>>>>
>>>> Thanks for your timely help
>>>>
>>>> -Pradeep
>>>>
>>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> My steps was pretty simple just as
>>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>>
>>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>>
>>>>> Then schedule a task on any of the node, here I was using slave node
>>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>>
>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>>> master@192.168.0.107:5050
>>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>>> task cluster-test submitted to slave
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>> ^C
>>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute
>>>>> --master=192.168.0.107:5050 --name="cluster-test"
>>>>> --command="/bin/sleep 100" --resources="cpus(*):1;mem(*):256"
>>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>>> master@192.168.0.107:5050
>>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>>> task cluster-test submitted to slave
>>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>>> Received status update TASK_RUNNING for task cluster-test
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> I just want to know how did you launch the tasks.
>>>>>>
>>>>>> 1. What processes you have started on Master?
>>>>>> 2. What are the processes you have started on Slaves?
>>>>>>
>>>>>> I am missing something here, otherwise all my slave have enough
>>>>>> memory and cpus to launch the tasks I mentioned.
>>>>>> What I am missing is some configuration steps.
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> I did some test with your case and found that the task can run
>>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>>> The logic is here:
>>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>>> allocate resources for offers.
>>>>>>>
>>>>>>> I see that every of your task need the minimum resources as "
>>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of
>>>>>>> your slaves have enough resources? If you want your task run on other
>>>>>>> slaves, then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Ondrej,
>>>>>>>>
>>>>>>>> Thanks for your reply
>>>>>>>>
>>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>>> slave IP address setting.
>>>>>>>>
>>>>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>>>>> schedule a task using
>>>>>>>>
>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>
>>>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>>>
>>>>>>>> I just start the mesos slaves like below
>>>>>>>>
>>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>>> --hostname=slave1
>>>>>>>>
>>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>>> same as one of the slave it runs on that system.
>>>>>>>>
>>>>>>>> But when I submit the task from some different system. It uses just
>>>>>>>> that system and queues the tasks not runs on the other slaves.
>>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>>
>>>>>>>> Do I need to start some process to push the task on all the slaves
>>>>>>>> equally? Am I missing something here?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>>> is 192.168.56.128)
>>>>>>>>>
>>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>>
>>>>>>>>> one way to do this is to create files
>>>>>>>>>
>>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>>
>>>>>>>>> for more configuration options see
>>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi Guangya,
>>>>>>>>>>
>>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>>
>>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>>> registered at the same address
>>>>>>>>>>
>>>>>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>>>>>> registered and de-registered to make a room for the next node. I can even
>>>>>>>>>> see this on
>>>>>>>>>> the UI interface, for some time one node got added and after some
>>>>>>>>>> time that will be replaced with the new slave node.
>>>>>>>>>>
>>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action
>>>>>>>>>> (18 bytes) to leveldb took 104089ns
>>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at
>>>>>>>>>> 384
>>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>>> learned notice for position 384
>>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action
>>>>>>>>>> (20 bytes) to leveldb took 95171ns
>>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys
>>>>>>>>>> from leveldb took 20333ns
>>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at
>>>>>>>>>> 384
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>>
>>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Guangya
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master
>>>>>>>>>>>> and 3 Slaves.
>>>>>>>>>>>>
>>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run
>>>>>>>>>>>> on different nodes. Here node means the physical boxes.
>>>>>>>>>>>>
>>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>>
>>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and
>>>>>>>>>>>> try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>>> resources are visible.
>>>>>>>>>>>> The other nodes resources are not visible. Some times visible
>>>>>>>>>>>> but in a de-actived state.
>>>>>>>>>>>>
>>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>>> you what is wrong.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Please let me know what could be the reason. All the nodes are
>>>>>>>>>>>> in the same network. *
>>>>>>>>>>>>
>>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>>
>>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>>
>>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>>
>>>>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>>
>>>>>>>>>>> It is not required.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Pradeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
>From your log, seems that the master process is exiting and this caused the
framework fail over to another mesos master. Can you please show more
detail for your issue reproduced steps?
I did some test by running mesos-execute on a client host which does not
have any mesos service and the task can schedule well.
root@mesos008:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 10"
--resources="cpus(*):1;mem(*):256"
I1005 18:59:47.974123 1233 sched.cpp:164] Version: 0.26.0
I1005 18:59:47.990890 1248 sched.cpp:262] New master detected at
master@192.168.0.107:5050
I1005 18:59:47.993074 1248 sched.cpp:272] No credentials provided.
Attempting to register without authentication
I1005 18:59:48.001194 1249 sched.cpp:641] Framework registered with
04b9af5e-e9b6-4c59-8734-eba407163922-0002
Framework registered with 04b9af5e-e9b6-4c59-8734-eba407163922-0002
task cluster-test submitted to slave c0e5fdde-595e-4768-9d04-25901d4523b6-S0
Received status update TASK_RUNNING for task cluster-test
Received status update TASK_FINISHED for task cluster-test
I1005 18:59:58.431144 1249 sched.cpp:1771] Asked to stop the driver
I1005 18:59:58.431591 1249 sched.cpp:1040] Stopping framework
'04b9af5e-e9b6-4c59-8734-eba407163922-0002'
root@mesos008:~/src/mesos/m1/mesos/build# ps -ef | grep mesos
root 1259 1159 0 19:06 pts/0 00:00:00 grep --color=auto mesos
Thanks,
Guangya
On Mon, Oct 5, 2015 at 6:50 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Guangya,
>
> I am facing one more issue. If I try to schedule the tasks from some
> external client system running the same cli mesos-execute.
> The tasks are not getting launched. The tasks reach the Master and it just
> drops the requests, below are the logs related to that
>
> I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework with
> checkpointing disabled and capabilities [ ]
> E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055
> I1005 11:33:35.026298 21369 master.cpp:1119] Framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
> disconnected
> I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
> I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
> E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
> with fd 14: Transport endpoint is not connected
> I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
> failover
> I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055
> W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
> offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
> framework has terminated or is inactive
> I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
> mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
> mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
> 77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055
> I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
> mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
> 77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055
> I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout,
> removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
> I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
> 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
> scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
>
>
> Can you please tell me what is the reason? The client is in the same
> network as well. But it does not run any master or slave processes.
>
> Thanks & Regards,
> Pradeeep
>
> On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> Glad it finally works! Not sure if you are using systemd.slice or not,
>> are you running to this issue:
>> https://issues.apache.org/jira/browse/MESOS-1195
>>
>> Hope Jie Yu can give you some help on this ;-)
>>
>> Thanks,
>>
>> Guangya
>>
>> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>>
>>> Thanks for sharing the information.
>>>
>>> Now I could launch the tasks. The problem was with the permission. If I
>>> start all the slaves and Master as root it works fine.
>>> Else I have problem with launching the tasks.
>>>
>>> But on one of the slave I could not launch the slave as root, I am
>>> facing the following issue.
>>>
>>> Failed to create a containerizer: Could not create MesosContainerizer:
>>> Failed to create launcher: Failed to create Linux launcher: Failed to mount
>>> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
>>> attached to another hierarchy
>>>
>>> I took that out from the cluster for now. The tasks are getting
>>> scheduled on the other two slave nodes.
>>>
>>> Thanks for your timely help
>>>
>>> -Pradeep
>>>
>>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> My steps was pretty simple just as
>>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>>
>>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>>
>>>> Then schedule a task on any of the node, here I was using slave node
>>>> mesos007, you can see that the two tasks was launched on different host.
>>>>
>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>>> --resources="cpus(*):1;mem(*):256"
>>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>>> master@192.168.0.107:5050
>>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>>> Attempting to register without authentication
>>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>>> task cluster-test submitted to slave
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>>> Received status update TASK_RUNNING for task cluster-test
>>>> ^C
>>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>>> --resources="cpus(*):1;mem(*):256"
>>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>>> master@192.168.0.107:5050
>>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>>> Attempting to register without authentication
>>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>>> task cluster-test submitted to slave
>>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>>> Received status update TASK_RUNNING for task cluster-test
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> I just want to know how did you launch the tasks.
>>>>>
>>>>> 1. What processes you have started on Master?
>>>>> 2. What are the processes you have started on Slaves?
>>>>>
>>>>> I am missing something here, otherwise all my slave have enough memory
>>>>> and cpus to launch the tasks I mentioned.
>>>>> What I am missing is some configuration steps.
>>>>>
>>>>> Thanks & Regards,
>>>>> Pradeep
>>>>>
>>>>>
>>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> I did some test with your case and found that the task can run
>>>>>> randomly on the three slave hosts, every time may have different result.
>>>>>> The logic is here:
>>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>>> The allocator will help random shuffle the slaves every time when
>>>>>> allocate resources for offers.
>>>>>>
>>>>>> I see that every of your task need the minimum resources as "
>>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>>>>>> slaves have enough resources? If you want your task run on other slaves,
>>>>>> then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Ondrej,
>>>>>>>
>>>>>>> Thanks for your reply
>>>>>>>
>>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>>> slave IP address setting.
>>>>>>>
>>>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>>>> schedule a task using
>>>>>>>
>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>
>>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>>
>>>>>>> I just start the mesos slaves like below
>>>>>>>
>>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>>> --hostname=slave1
>>>>>>>
>>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>>> same as one of the slave it runs on that system.
>>>>>>>
>>>>>>> But when I submit the task from some different system. It uses just
>>>>>>> that system and queues the tasks not runs on the other slaves.
>>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>>
>>>>>>> Do I need to start some process to push the task on all the slaves
>>>>>>> equally? Am I missing something here?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Pradeep
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>>> is 192.168.56.128)
>>>>>>>>
>>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>>
>>>>>>>> one way to do this is to create files
>>>>>>>>
>>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>>
>>>>>>>> for more configuration options see
>>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi Guangya,
>>>>>>>>>
>>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>>
>>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>>> registered at the same address
>>>>>>>>>
>>>>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>>>>> registered and de-registered to make a room for the next node. I can even
>>>>>>>>> see this on
>>>>>>>>> the UI interface, for some time one node got added and after some
>>>>>>>>> time that will be replaced with the new slave node.
>>>>>>>>>
>>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>>>>>> bytes) to leveldb took 104089ns
>>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at
>>>>>>>>> 384
>>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) with cpus(*):8; mem(*):14930;
>>>>>>>>> disk(*):218578; ports(*):[31000-32000]
>>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>> 127.0.1.1:5051 (192.168.0.116) disconnected
>>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@
>>>>>>>>> 127.0.1.1:5051 (192.168.0.116)
>>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>>> learned notice for position 384
>>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>>>>>> bytes) to leveldb took 95171ns
>>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>>>>>> leveldb took 20333ns
>>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at
>>>>>>>>> 384
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Pradeep
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Pradeep,
>>>>>>>>>>
>>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Guangya
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master
>>>>>>>>>>> and 3 Slaves.
>>>>>>>>>>>
>>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>>>>>
>>>>>>>>>>> I tried running the tasks by configuring one Node cluster.
>>>>>>>>>>> Tested the task scheduling using mesos-execute, works fine.
>>>>>>>>>>>
>>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and
>>>>>>>>>>> try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>>> resources are visible.
>>>>>>>>>>> The other nodes resources are not visible. Some times visible
>>>>>>>>>>> but in a de-actived state.
>>>>>>>>>>>
>>>>>>>>>> Can you please append some logs from mesos-slave and
>>>>>>>>>> mesos-master? There should be some logs in either master or slave telling
>>>>>>>>>> you what is wrong.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Please let me know what could be the reason. All the nodes are
>>>>>>>>>>> in the same network. *
>>>>>>>>>>>
>>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>>
>>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>>
>>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>>
>>>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I*s it required to register the frameworks from every slave
>>>>>>>>>>> node on the Master?*
>>>>>>>>>>>
>>>>>>>>>> It is not required.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Pradeep
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
I am facing one more issue. If I try to schedule the tasks from some
external client system running the same cli mesos-execute.
The tasks are not getting launched. The tasks reach the Master and it just
drops the requests, below are the logs related to that
I1005 11:33:35.025594 21369 master.cpp:2250] Subscribing framework with
checkpointing disabled and capabilities [ ]
E1005 11:33:35.026100 21373 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1005 11:33:35.026129 21372 hierarchical.hpp:515] Added framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.026298 21369 master.cpp:1119] Framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 disconnected
I1005 11:33:35.026329 21369 master.cpp:2475] Disconnecting framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
I1005 11:33:35.026340 21369 master.cpp:2499] Deactivating framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
E1005 11:33:35.026345 21373 process.cpp:1912] Failed to shutdown socket
with fd 14: Transport endpoint is not connected
I1005 11:33:35.026376 21369 master.cpp:1143] Giving framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259 0ns to
failover
I1005 11:33:35.026743 21372 hierarchical.hpp:599] Deactivated framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
W1005 11:33:35.026757 21368 master.cpp:4828] Master returning resources
offered to framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 because the
framework has terminated or is inactive
I1005 11:33:35.027014 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
mem(*):14868; disk(*):218835; ports(*):[31000-32000] (total: cpus(*):8;
mem(*):14868; disk(*):218835; ports(*):[31000-32000], allocated: ) on slave
77539063-89ce-4efa-a20b-ca788abbd912-S2 from framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.027159 21371 hierarchical.hpp:1103] Recovered cpus(*):8;
mem(*):14930; disk(*):218578; ports(*):[31000-32000] (total: cpus(*):8;
mem(*):14930; disk(*):218578; ports(*):[31000-32000], allocated: ) on slave
77539063-89ce-4efa-a20b-ca788abbd912-S1 from framework
77539063-89ce-4efa-a20b-ca788abbd912-0055
I1005 11:33:35.027668 21366 master.cpp:4815] Framework failover timeout,
removing framework 77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
I1005 11:33:35.027715 21366 master.cpp:5571] Removing framework
77539063-89ce-4efa-a20b-ca788abbd912-0055 () at
scheduler-b1bc0243-b5be-44ae-894c-ca318c24ce6d@127.0.1.1:47259
Can you please tell me what is the reason? The client is in the same
network as well. But it does not run any master or slave processes.
Thanks & Regards,
Pradeeep
On 5 October 2015 at 12:13, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> Glad it finally works! Not sure if you are using systemd.slice or not, are
> you running to this issue:
> https://issues.apache.org/jira/browse/MESOS-1195
>
> Hope Jie Yu can give you some help on this ;-)
>
> Thanks,
>
> Guangya
>
> On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>>
>> Thanks for sharing the information.
>>
>> Now I could launch the tasks. The problem was with the permission. If I
>> start all the slaves and Master as root it works fine.
>> Else I have problem with launching the tasks.
>>
>> But on one of the slave I could not launch the slave as root, I am facing
>> the following issue.
>>
>> Failed to create a containerizer: Could not create MesosContainerizer:
>> Failed to create launcher: Failed to create Linux launcher: Failed to mount
>> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
>> attached to another hierarchy
>>
>> I took that out from the cluster for now. The tasks are getting scheduled
>> on the other two slave nodes.
>>
>> Thanks for your timely help
>>
>> -Pradeep
>>
>> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> My steps was pretty simple just as
>>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>>
>>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>>
>>> Then schedule a task on any of the node, here I was using slave node
>>> mesos007, you can see that the two tasks was launched on different host.
>>>
>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>>> master@192.168.0.107:5050
>>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>>> Received status update TASK_RUNNING for task cluster-test
>>> ^C
>>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>>> --resources="cpus(*):1;mem(*):256"
>>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>>> master@192.168.0.107:5050
>>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>>> Attempting to register without authentication
>>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>>> task cluster-test submitted to slave
>>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>>> Received status update TASK_RUNNING for task cluster-test
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Guangya,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> I just want to know how did you launch the tasks.
>>>>
>>>> 1. What processes you have started on Master?
>>>> 2. What are the processes you have started on Slaves?
>>>>
>>>> I am missing something here, otherwise all my slave have enough memory
>>>> and cpus to launch the tasks I mentioned.
>>>> What I am missing is some configuration steps.
>>>>
>>>> Thanks & Regards,
>>>> Pradeep
>>>>
>>>>
>>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> I did some test with your case and found that the task can run
>>>>> randomly on the three slave hosts, every time may have different result.
>>>>> The logic is here:
>>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>>> The allocator will help random shuffle the slaves every time when
>>>>> allocate resources for offers.
>>>>>
>>>>> I see that every of your task need the minimum resources as "
>>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>>>>> slaves have enough resources? If you want your task run on other slaves,
>>>>> then those slaves need to have at least 3 cpus and 2550M memory.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi Ondrej,
>>>>>>
>>>>>> Thanks for your reply
>>>>>>
>>>>>> I did solve that issue, yes you are right there was an issue with
>>>>>> slave IP address setting.
>>>>>>
>>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>>> schedule a task using
>>>>>>
>>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>>
>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>
>>>>>> I just start the mesos slaves like below
>>>>>>
>>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>>> --hostname=slave1
>>>>>>
>>>>>> If I submit the task using the above (mesos-execute) command from
>>>>>> same as one of the slave it runs on that system.
>>>>>>
>>>>>> But when I submit the task from some different system. It uses just
>>>>>> that system and queues the tasks not runs on the other slaves.
>>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>>
>>>>>> Do I need to start some process to push the task on all the slaves
>>>>>> equally? Am I missing something here?
>>>>>>
>>>>>> Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>>> is 192.168.56.128)
>>>>>>>
>>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>>> 2) set mesos options - ip, hostname
>>>>>>>
>>>>>>> one way to do this is to create files
>>>>>>>
>>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>>
>>>>>>> for more configuration options see
>>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>>
>>>>>>>> Hi Guangya,
>>>>>>>>
>>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>>
>>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>>> registered at the same address
>>>>>>>>
>>>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>>>> registered and de-registered to make a room for the next node. I can even
>>>>>>>> see this on
>>>>>>>> the UI interface, for some time one node got added and after some
>>>>>>>> time that will be replaced with the new slave node.
>>>>>>>>
>>>>>>>> The above log is followed by the below log messages.
>>>>>>>>
>>>>>>>>
>>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>>>>> bytes) to leveldb took 104089ns
>>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>>>>> ports(*):[31000-32000]
>>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116) disconnected
>>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116)
>>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>>> (192.168.0.116)
>>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>>> learned notice for position 384
>>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>>>>> bytes) to leveldb took 95171ns
>>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>>>>> leveldb took 20333ns
>>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Pradeep,
>>>>>>>>>
>>>>>>>>> Please check some of my questions in line.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Guangya
>>>>>>>>>
>>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master
>>>>>>>>>> and 3 Slaves.
>>>>>>>>>>
>>>>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>>>>
>>>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>>>>
>>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and
>>>>>>>>>> try to see the resources on the master (in GUI) only the Master node
>>>>>>>>>> resources are visible.
>>>>>>>>>> The other nodes resources are not visible. Some times visible
>>>>>>>>>> but in a de-actived state.
>>>>>>>>>>
>>>>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>>>>> There should be some logs in either master or slave telling you what is
>>>>>>>>> wrong.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Please let me know what could be the reason. All the nodes are
>>>>>>>>>> in the same network. *
>>>>>>>>>>
>>>>>>>>>> When I try to schedule a task using
>>>>>>>>>>
>>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>>
>>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>>
>>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I*s it required to register the frameworks from every slave node
>>>>>>>>>> on the Master?*
>>>>>>>>>>
>>>>>>>>> It is not required.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Pradeep
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
Glad it finally works! Not sure if you are using systemd.slice or not, are
you running to this issue: https://issues.apache.org/jira/browse/MESOS-1195
Hope Jie Yu can give you some help on this ;-)
Thanks,
Guangya
On Mon, Oct 5, 2015 at 5:25 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Guangya,
>
>
> Thanks for sharing the information.
>
> Now I could launch the tasks. The problem was with the permission. If I
> start all the slaves and Master as root it works fine.
> Else I have problem with launching the tasks.
>
> But on one of the slave I could not launch the slave as root, I am facing
> the following issue.
>
> Failed to create a containerizer: Could not create MesosContainerizer:
> Failed to create launcher: Failed to create Linux launcher: Failed to mount
> cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
> attached to another hierarchy
>
> I took that out from the cluster for now. The tasks are getting scheduled
> on the other two slave nodes.
>
> Thanks for your timely help
>
> -Pradeep
>
> On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> My steps was pretty simple just as
>> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>>
>> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
>> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
>> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
>> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>>
>> Then schedule a task on any of the node, here I was using slave node
>> mesos007, you can see that the two tasks was launched on different host.
>>
>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>> --resources="cpus(*):1;mem(*):256"
>> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
>> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
>> master@192.168.0.107:5050
>> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
>> Attempting to register without authentication
>> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
>> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
>> task cluster-test submitted to slave
>> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
>> Received status update TASK_RUNNING for task cluster-test
>> ^C
>> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
>> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
>> --resources="cpus(*):1;mem(*):256"
>> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
>> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
>> master@192.168.0.107:5050
>> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
>> Attempting to register without authentication
>> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
>> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
>> task cluster-test submitted to slave
>> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
>> Received status update TASK_RUNNING for task cluster-test
>>
>> Thanks,
>>
>> Guangya
>>
>> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Guangya,
>>>
>>> Thanks for your reply.
>>>
>>> I just want to know how did you launch the tasks.
>>>
>>> 1. What processes you have started on Master?
>>> 2. What are the processes you have started on Slaves?
>>>
>>> I am missing something here, otherwise all my slave have enough memory
>>> and cpus to launch the tasks I mentioned.
>>> What I am missing is some configuration steps.
>>>
>>> Thanks & Regards,
>>> Pradeep
>>>
>>>
>>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> I did some test with your case and found that the task can run randomly
>>>> on the three slave hosts, every time may have different result. The logic
>>>> is here:
>>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>>> The allocator will help random shuffle the slaves every time when
>>>> allocate resources for offers.
>>>>
>>>> I see that every of your task need the minimum resources as "
>>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>>>> slaves have enough resources? If you want your task run on other slaves,
>>>> then those slaves need to have at least 3 cpus and 2550M memory.
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi Ondrej,
>>>>>
>>>>> Thanks for your reply
>>>>>
>>>>> I did solve that issue, yes you are right there was an issue with
>>>>> slave IP address setting.
>>>>>
>>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>>> schedule a task using
>>>>>
>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>
>>>>> The tasks always get scheduled on the same node. The resources from
>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>
>>>>> I just start the mesos slaves like below
>>>>>
>>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>>> --hostname=slave1
>>>>>
>>>>> If I submit the task using the above (mesos-execute) command from same
>>>>> as one of the slave it runs on that system.
>>>>>
>>>>> But when I submit the task from some different system. It uses just
>>>>> that system and queues the tasks not runs on the other slaves.
>>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>>
>>>>> Do I need to start some process to push the task on all the slaves
>>>>> equally? Am I missing something here?
>>>>>
>>>>> Regards,
>>>>> Pradeep
>>>>>
>>>>>
>>>>>
>>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> the problem is with IP your slave advertise - mesos by default
>>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>>> is 192.168.56.128)
>>>>>>
>>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>>> 2) set mesos options - ip, hostname
>>>>>>
>>>>>> one way to do this is to create files
>>>>>>
>>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>>
>>>>>> for more configuration options see
>>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com>:
>>>>>>
>>>>>>> Hi Guangya,
>>>>>>>
>>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>>
>>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>>> registered at the same address
>>>>>>>
>>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>>> registered and de-registered to make a room for the next node. I can even
>>>>>>> see this on
>>>>>>> the UI interface, for some time one node got added and after some
>>>>>>> time that will be replaced with the new slave node.
>>>>>>>
>>>>>>> The above log is followed by the below log messages.
>>>>>>>
>>>>>>>
>>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>>>> bytes) to leveldb took 104089ns
>>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>>>> ports(*):[31000-32000]
>>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>> (192.168.0.116) disconnected
>>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>> (192.168.0.116)
>>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>>> (192.168.0.116)
>>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received
>>>>>>> learned notice for position 384
>>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>>>> bytes) to leveldb took 95171ns
>>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>>>> leveldb took 20333ns
>>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Pradeep
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Pradeep,
>>>>>>>>
>>>>>>>> Please check some of my questions in line.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Guangya
>>>>>>>>
>>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and
>>>>>>>>> 3 Slaves.
>>>>>>>>>
>>>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>>>
>>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>>>
>>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and try
>>>>>>>>> to see the resources on the master (in GUI) only the Master node resources
>>>>>>>>> are visible.
>>>>>>>>> The other nodes resources are not visible. Some times visible but
>>>>>>>>> in a de-actived state.
>>>>>>>>>
>>>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>>>> There should be some logs in either master or slave telling you what is
>>>>>>>> wrong.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>>>>> the same network. *
>>>>>>>>>
>>>>>>>>> When I try to schedule a task using
>>>>>>>>>
>>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>>
>>>>>>>>> The tasks always get scheduled on the same node. The resources
>>>>>>>>> from the other nodes are not getting used to schedule the tasks.
>>>>>>>>>
>>>>>>>> Based on your previous question, there is only one node in your
>>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>>> what is wrong with other three nodes first.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I*s it required to register the frameworks from every slave node
>>>>>>>>> on the Master?*
>>>>>>>>>
>>>>>>>> It is not required.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Pradeep
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
Thanks for sharing the information.
Now I could launch the tasks. The problem was with the permission. If I
start all the slaves and Master as root it works fine.
Else I have problem with launching the tasks.
But on one of the slave I could not launch the slave as root, I am facing
the following issue.
Failed to create a containerizer: Could not create MesosContainerizer:
Failed to create launcher: Failed to create Linux launcher: Failed to mount
cgroups hierarchy at '/sys/fs/cgroup/freezer': 'freezer' is already
attached to another hierarchy
I took that out from the cluster for now. The tasks are getting scheduled
on the other two slave nodes.
Thanks for your timely help
-Pradeep
On 5 October 2015 at 10:54, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> My steps was pretty simple just as
> https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
>
> On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
> ./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
> On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
> ./bin/mesos-slave.sh --master=192.168.0.107:5050
>
> Then schedule a task on any of the node, here I was using slave node
> mesos007, you can see that the two tasks was launched on different host.
>
> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
> --resources="cpus(*):1;mem(*):256"
> I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
> I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
> master@192.168.0.107:5050
> I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
> Attempting to register without authentication
> I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
> c0e5fdde-595e-4768-9d04-25901d4523b6-0002
> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
> task cluster-test submitted to slave
> c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
> Received status update TASK_RUNNING for task cluster-test
> ^C
> root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
> 192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
> --resources="cpus(*):1;mem(*):256"
> I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
> I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
> master@192.168.0.107:5050
> I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
> Attempting to register without authentication
> I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
> c0e5fdde-595e-4768-9d04-25901d4523b6-0003
> Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
> task cluster-test submitted to slave
> c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
> Received status update TASK_RUNNING for task cluster-test
>
> Thanks,
>
> Guangya
>
> On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Guangya,
>>
>> Thanks for your reply.
>>
>> I just want to know how did you launch the tasks.
>>
>> 1. What processes you have started on Master?
>> 2. What are the processes you have started on Slaves?
>>
>> I am missing something here, otherwise all my slave have enough memory
>> and cpus to launch the tasks I mentioned.
>> What I am missing is some configuration steps.
>>
>> Thanks & Regards,
>> Pradeep
>>
>>
>> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> I did some test with your case and found that the task can run randomly
>>> on the three slave hosts, every time may have different result. The logic
>>> is here:
>>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>>> The allocator will help random shuffle the slaves every time when
>>> allocate resources for offers.
>>>
>>> I see that every of your task need the minimum resources as "
>>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>>> slaves have enough resources? If you want your task run on other slaves,
>>> then those slaves need to have at least 3 cpus and 2550M memory.
>>>
>>> Thanks
>>>
>>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi Ondrej,
>>>>
>>>> Thanks for your reply
>>>>
>>>> I did solve that issue, yes you are right there was an issue with slave
>>>> IP address setting.
>>>>
>>>> Now I am facing issue with the scheduling the tasks. When I try to
>>>> schedule a task using
>>>>
>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>> --resources="cpus(*):3;mem(*):2560"
>>>>
>>>> The tasks always get scheduled on the same node. The resources from the
>>>> other nodes are not getting used to schedule the tasks.
>>>>
>>>> I just start the mesos slaves like below
>>>>
>>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>>> --hostname=slave1
>>>>
>>>> If I submit the task using the above (mesos-execute) command from same
>>>> as one of the slave it runs on that system.
>>>>
>>>> But when I submit the task from some different system. It uses just
>>>> that system and queues the tasks not runs on the other slaves.
>>>> Some times I see the message "Failed to getgid: unknown user"
>>>>
>>>> Do I need to start some process to push the task on all the slaves
>>>> equally? Am I missing something here?
>>>>
>>>> Regards,
>>>> Pradeep
>>>>
>>>>
>>>>
>>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> the problem is with IP your slave advertise - mesos by default
>>>>> resolves your hostname - there are several solutions (let say your node ip
>>>>> is 192.168.56.128)
>>>>>
>>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>>> 2) set mesos options - ip, hostname
>>>>>
>>>>> one way to do this is to create files
>>>>>
>>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>>
>>>>> for more configuration options see
>>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pradeepkiruvale@gmail.com
>>>>> >:
>>>>>
>>>>>> Hi Guangya,
>>>>>>
>>>>>> Thanks for reply. I found one interesting log message.
>>>>>>
>>>>>> 7410 master.cpp:5977] Removed slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>>> registered at the same address
>>>>>>
>>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>>> registered and de-registered to make a room for the next node. I can even
>>>>>> see this on
>>>>>> the UI interface, for some time one node got added and after some
>>>>>> time that will be replaced with the new slave node.
>>>>>>
>>>>>> The above log is followed by the below log messages.
>>>>>>
>>>>>>
>>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>>> bytes) to leveldb took 104089ns
>>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 15: Transport endpoint is not connected
>>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>>> ports(*):[31000-32000]
>>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>> (192.168.0.116) disconnected
>>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>> (192.168.0.116)
>>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>>> socket with fd 16: Transport endpoint is not connected
>>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>>> (192.168.0.116)
>>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>>>>> notice for position 384
>>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>>> bytes) to leveldb took 95171ns
>>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>>> leveldb took 20333ns
>>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Pradeep,
>>>>>>>
>>>>>>> Please check some of my questions in line.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Guangya
>>>>>>>
>>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and
>>>>>>>> 3 Slaves.
>>>>>>>>
>>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>>
>>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>>
>>>>>>>> When I configure three Node cluster (1master and 3 slaves) and try
>>>>>>>> to see the resources on the master (in GUI) only the Master node resources
>>>>>>>> are visible.
>>>>>>>> The other nodes resources are not visible. Some times visible but
>>>>>>>> in a de-actived state.
>>>>>>>>
>>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>>> There should be some logs in either master or slave telling you what is
>>>>>>> wrong.
>>>>>>>
>>>>>>>>
>>>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>>>> the same network. *
>>>>>>>>
>>>>>>>> When I try to schedule a task using
>>>>>>>>
>>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>>
>>>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>>>
>>>>>>> Based on your previous question, there is only one node in your
>>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>>> what is wrong with other three nodes first.
>>>>>>>
>>>>>>>>
>>>>>>>> I*s it required to register the frameworks from every slave node
>>>>>>>> on the Master?*
>>>>>>>>
>>>>>>> It is not required.
>>>>>>>
>>>>>>>>
>>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks & Regards,
>>>>>>>> Pradeep
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
My steps was pretty simple just as
https://github.com/apache/mesos/blob/master/docs/getting-started.md#examples
On Master node: root@mesos1:~/src/mesos/m1/mesos/build# GLOG_v=1
./bin/mesos-master.sh --ip=192.168.0.107 --work_dir=/var/lib/mesos
On 3 Slave node: root@mesos007:~/src/mesos/m1/mesos/build# GLOG_v=1
./bin/mesos-slave.sh --master=192.168.0.107:5050
Then schedule a task on any of the node, here I was using slave node
mesos007, you can see that the two tasks was launched on different host.
root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
--resources="cpus(*):1;mem(*):256"
I1005 16:49:11.013432 2971 sched.cpp:164] Version: 0.26.0
I1005 16:49:11.027802 2992 sched.cpp:262] New master detected at
master@192.168.0.107:5050
I1005 16:49:11.029579 2992 sched.cpp:272] No credentials provided.
Attempting to register without authentication
I1005 16:49:11.038182 2985 sched.cpp:641] Framework registered with
c0e5fdde-595e-4768-9d04-25901d4523b6-0002
Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0002
task cluster-test submitted to slave
c0e5fdde-595e-4768-9d04-25901d4523b6-S0 <<<<<<<<<<<<<<<<<<
Received status update TASK_RUNNING for task cluster-test
^C
root@mesos007:~/src/mesos/m1/mesos/build# ./src/mesos-execute --master=
192.168.0.107:5050 --name="cluster-test" --command="/bin/sleep 100"
--resources="cpus(*):1;mem(*):256"
I1005 16:50:18.346984 3036 sched.cpp:164] Version: 0.26.0
I1005 16:50:18.366114 3055 sched.cpp:262] New master detected at
master@192.168.0.107:5050
I1005 16:50:18.368010 3055 sched.cpp:272] No credentials provided.
Attempting to register without authentication
I1005 16:50:18.376338 3056 sched.cpp:641] Framework registered with
c0e5fdde-595e-4768-9d04-25901d4523b6-0003
Framework registered with c0e5fdde-595e-4768-9d04-25901d4523b6-0003
task cluster-test submitted to slave
c0e5fdde-595e-4768-9d04-25901d4523b6-S1 <<<<<<<<<<<<<<<<<<<<
Received status update TASK_RUNNING for task cluster-test
Thanks,
Guangya
On Mon, Oct 5, 2015 at 4:21 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Guangya,
>
> Thanks for your reply.
>
> I just want to know how did you launch the tasks.
>
> 1. What processes you have started on Master?
> 2. What are the processes you have started on Slaves?
>
> I am missing something here, otherwise all my slave have enough memory and
> cpus to launch the tasks I mentioned.
> What I am missing is some configuration steps.
>
> Thanks & Regards,
> Pradeep
>
>
> On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> I did some test with your case and found that the task can run randomly
>> on the three slave hosts, every time may have different result. The logic
>> is here:
>> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
>> The allocator will help random shuffle the slaves every time when
>> allocate resources for offers.
>>
>> I see that every of your task need the minimum resources as "
>> resources="cpus(*):3;mem(*):2560", can you help check if all of your
>> slaves have enough resources? If you want your task run on other slaves,
>> then those slaves need to have at least 3 cpus and 2550M memory.
>>
>> Thanks
>>
>> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi Ondrej,
>>>
>>> Thanks for your reply
>>>
>>> I did solve that issue, yes you are right there was an issue with slave
>>> IP address setting.
>>>
>>> Now I am facing issue with the scheduling the tasks. When I try to
>>> schedule a task using
>>>
>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>> --resources="cpus(*):3;mem(*):2560"
>>>
>>> The tasks always get scheduled on the same node. The resources from the
>>> other nodes are not getting used to schedule the tasks.
>>>
>>> I just start the mesos slaves like below
>>>
>>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos
>>> --hostname=slave1
>>>
>>> If I submit the task using the above (mesos-execute) command from same
>>> as one of the slave it runs on that system.
>>>
>>> But when I submit the task from some different system. It uses just that
>>> system and queues the tasks not runs on the other slaves.
>>> Some times I see the message "Failed to getgid: unknown user"
>>>
>>> Do I need to start some process to push the task on all the slaves
>>> equally? Am I missing something here?
>>>
>>> Regards,
>>> Pradeep
>>>
>>>
>>>
>>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> the problem is with IP your slave advertise - mesos by default resolves
>>>> your hostname - there are several solutions (let say your node ip is
>>>> 192.168.56.128)
>>>>
>>>> 1) export LIBPROCESS_IP=192.168.56.128
>>>> 2) set mesos options - ip, hostname
>>>>
>>>> one way to do this is to create files
>>>>
>>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>>
>>>> for more configuration options see
>>>> http://mesos.apache.org/documentation/latest/configuration
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>
>>>> :
>>>>
>>>>> Hi Guangya,
>>>>>
>>>>> Thanks for reply. I found one interesting log message.
>>>>>
>>>>> 7410 master.cpp:5977] Removed slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>>> registered at the same address
>>>>>
>>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>>> registered and de-registered to make a room for the next node. I can even
>>>>> see this on
>>>>> the UI interface, for some time one node got added and after some time
>>>>> that will be replaced with the new slave node.
>>>>>
>>>>> The above log is followed by the below log messages.
>>>>>
>>>>>
>>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>>> bytes) to leveldb took 104089ns
>>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 15: Transport endpoint is not connected
>>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>>> ports(*):[31000-32000]
>>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116) disconnected
>>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116)
>>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown
>>>>> socket with fd 16: Transport endpoint is not connected
>>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>>> (192.168.0.116)
>>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>>>> notice for position 384
>>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>>> bytes) to leveldb took 95171ns
>>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>>> leveldb took 20333ns
>>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Pradeep
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>>
>>>>>> Hi Pradeep,
>>>>>>
>>>>>> Please check some of my questions in line.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Guangya
>>>>>>
>>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>>>> Slaves.
>>>>>>>
>>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>>> different nodes. Here node means the physical boxes.
>>>>>>>
>>>>>>> I tried running the tasks by configuring one Node cluster. Tested
>>>>>>> the task scheduling using mesos-execute, works fine.
>>>>>>>
>>>>>>> When I configure three Node cluster (1master and 3 slaves) and try
>>>>>>> to see the resources on the master (in GUI) only the Master node resources
>>>>>>> are visible.
>>>>>>> The other nodes resources are not visible. Some times visible but
>>>>>>> in a de-actived state.
>>>>>>>
>>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>>> There should be some logs in either master or slave telling you what is
>>>>>> wrong.
>>>>>>
>>>>>>>
>>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>>> the same network. *
>>>>>>>
>>>>>>> When I try to schedule a task using
>>>>>>>
>>>>>>> /src/mesos-execute --master=192.168.0.102:5050
>>>>>>> --name="cluster-test" --command="/usr/bin/hackbench -s 4096 -l 10845760 -g
>>>>>>> 2 -f 2 -P" --resources="cpus(*):3;mem(*):2560"
>>>>>>>
>>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>>
>>>>>> Based on your previous question, there is only one node in your
>>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>>> what is wrong with other three nodes first.
>>>>>>
>>>>>>>
>>>>>>> I*s it required to register the frameworks from every slave node on
>>>>>>> the Master?*
>>>>>>>
>>>>>> It is not required.
>>>>>>
>>>>>>>
>>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>>
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Pradeep
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
Thanks for your reply.
I just want to know how did you launch the tasks.
1. What processes you have started on Master?
2. What are the processes you have started on Slaves?
I am missing something here, otherwise all my slave have enough memory and
cpus to launch the tasks I mentioned.
What I am missing is some configuration steps.
Thanks & Regards,
Pradeep
On 3 October 2015 at 13:14, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> I did some test with your case and found that the task can run randomly on
> the three slave hosts, every time may have different result. The logic is
> here:
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
> The allocator will help random shuffle the slaves every time when
> allocate resources for offers.
>
> I see that every of your task need the minimum resources as "
> resources="cpus(*):3;mem(*):2560", can you help check if all of your
> slaves have enough resources? If you want your task run on other slaves,
> then those slaves need to have at least 3 cpus and 2550M memory.
>
> Thanks
>
> On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi Ondrej,
>>
>> Thanks for your reply
>>
>> I did solve that issue, yes you are right there was an issue with slave
>> IP address setting.
>>
>> Now I am facing issue with the scheduling the tasks. When I try to
>> schedule a task using
>>
>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>> --resources="cpus(*):3;mem(*):2560"
>>
>> The tasks always get scheduled on the same node. The resources from the
>> other nodes are not getting used to schedule the tasks.
>>
>> I just start the mesos slaves like below
>>
>> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1
>>
>> If I submit the task using the above (mesos-execute) command from same as
>> one of the slave it runs on that system.
>>
>> But when I submit the task from some different system. It uses just that
>> system and queues the tasks not runs on the other slaves.
>> Some times I see the message "Failed to getgid: unknown user"
>>
>> Do I need to start some process to push the task on all the slaves
>> equally? Am I missing something here?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> the problem is with IP your slave advertise - mesos by default resolves
>>> your hostname - there are several solutions (let say your node ip is
>>> 192.168.56.128)
>>>
>>> 1) export LIBPROCESS_IP=192.168.56.128
>>> 2) set mesos options - ip, hostname
>>>
>>> one way to do this is to create files
>>>
>>> echo "192.168.56.128" > /etc/mesos-slave/ip
>>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>>
>>> for more configuration options see
>>> http://mesos.apache.org/documentation/latest/configuration
>>>
>>>
>>>
>>>
>>>
>>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>>>
>>>> Hi Guangya,
>>>>
>>>> Thanks for reply. I found one interesting log message.
>>>>
>>>> 7410 master.cpp:5977] Removed slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>>> registered at the same address
>>>>
>>>> Mostly because of this issue, the systems/slave nodes are getting
>>>> registered and de-registered to make a room for the next node. I can even
>>>> see this on
>>>> the UI interface, for some time one node got added and after some time
>>>> that will be replaced with the new slave node.
>>>>
>>>> The above log is followed by the below log messages.
>>>>
>>>>
>>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>>> bytes) to leveldb took 104089ns
>>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 15: Transport endpoint is not connected
>>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>>> ports(*):[31000-32000]
>>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116) disconnected
>>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
>>>> with fd 16: Transport endpoint is not connected
>>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>>> (192.168.0.116)
>>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>>> notice for position 384
>>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>>> bytes) to leveldb took 95171ns
>>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>>> leveldb took 20333ns
>>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>>
>>>>
>>>> Thanks,
>>>> Pradeep
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>>
>>>>> Hi Pradeep,
>>>>>
>>>>> Please check some of my questions in line.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Guangya
>>>>>
>>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>>> pradeepkiruvale@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>>> Slaves.
>>>>>>
>>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>>> different nodes. Here node means the physical boxes.
>>>>>>
>>>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>>>> task scheduling using mesos-execute, works fine.
>>>>>>
>>>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>>>> see the resources on the master (in GUI) only the Master node resources are
>>>>>> visible.
>>>>>> The other nodes resources are not visible. Some times visible but in
>>>>>> a de-actived state.
>>>>>>
>>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>>> There should be some logs in either master or slave telling you what is
>>>>> wrong.
>>>>>
>>>>>>
>>>>>> *Please let me know what could be the reason. All the nodes are in
>>>>>> the same network. *
>>>>>>
>>>>>> When I try to schedule a task using
>>>>>>
>>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>>
>>>>>> The tasks always get scheduled on the same node. The resources from
>>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>>
>>>>> Based on your previous question, there is only one node in your
>>>>> cluster, that's why other nodes are not available. We need first identify
>>>>> what is wrong with other three nodes first.
>>>>>
>>>>>>
>>>>>> I*s it required to register the frameworks from every slave node on
>>>>>> the Master?*
>>>>>>
>>>>> It is not required.
>>>>>
>>>>>>
>>>>>> *I have configured this cluster using the git-hub code.*
>>>>>>
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Pradeep
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
I did some test with your case and found that the task can run randomly on
the three slave hosts, every time may have different result. The logic is
here:
https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.hpp#L1263-#L1266
The allocator will help random shuffle the slaves every time when allocate
resources for offers.
I see that every of your task need the minimum resources as "
resources="cpus(*):3;mem(*):2560", can you help check if all of your slaves
have enough resources? If you want your task run on other slaves, then
those slaves need to have at least 3 cpus and 2550M memory.
Thanks
On Fri, Oct 2, 2015 at 9:26 PM, Pradeep Kiruvale <pr...@gmail.com>
wrote:
> Hi Ondrej,
>
> Thanks for your reply
>
> I did solve that issue, yes you are right there was an issue with slave IP
> address setting.
>
> Now I am facing issue with the scheduling the tasks. When I try to
> schedule a task using
>
> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
> --resources="cpus(*):3;mem(*):2560"
>
> The tasks always get scheduled on the same node. The resources from the
> other nodes are not getting used to schedule the tasks.
>
> I just start the mesos slaves like below
>
> ./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1
>
> If I submit the task using the above (mesos-execute) command from same as
> one of the slave it runs on that system.
>
> But when I submit the task from some different system. It uses just that
> system and queues the tasks not runs on the other slaves.
> Some times I see the message "Failed to getgid: unknown user"
>
> Do I need to start some process to push the task on all the slaves
> equally? Am I missing something here?
>
> Regards,
> Pradeep
>
>
>
> On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> the problem is with IP your slave advertise - mesos by default resolves
>> your hostname - there are several solutions (let say your node ip is
>> 192.168.56.128)
>>
>> 1) export LIBPROCESS_IP=192.168.56.128
>> 2) set mesos options - ip, hostname
>>
>> one way to do this is to create files
>>
>> echo "192.168.56.128" > /etc/mesos-slave/ip
>> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>>
>> for more configuration options see
>> http://mesos.apache.org/documentation/latest/configuration
>>
>>
>>
>>
>>
>> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>>
>>> Hi Guangya,
>>>
>>> Thanks for reply. I found one interesting log message.
>>>
>>> 7410 master.cpp:5977] Removed slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>>> registered at the same address
>>>
>>> Mostly because of this issue, the systems/slave nodes are getting
>>> registered and de-registered to make a room for the next node. I can even
>>> see this on
>>> the UI interface, for some time one node got added and after some time
>>> that will be replaced with the new slave node.
>>>
>>> The above log is followed by the below log messages.
>>>
>>>
>>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18
>>> bytes) to leveldb took 104089ns
>>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 15: Transport endpoint is not connected
>>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>>> ports(*):[31000-32000]
>>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116) disconnected
>>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
>>> with fd 16: Transport endpoint is not connected
>>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>>> (192.168.0.116)
>>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>>> notice for position 384
>>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20
>>> bytes) to leveldb took 95171ns
>>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>>> leveldb took 20333ns
>>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>>
>>>
>>> Thanks,
>>> Pradeep
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> Please check some of my questions in line.
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>>> pradeepkiruvale@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>>> Slaves.
>>>>>
>>>>> One slave runs on the Master Node itself and Other slaves run on
>>>>> different nodes. Here node means the physical boxes.
>>>>>
>>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>>> task scheduling using mesos-execute, works fine.
>>>>>
>>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>>> see the resources on the master (in GUI) only the Master node resources are
>>>>> visible.
>>>>> The other nodes resources are not visible. Some times visible but in
>>>>> a de-actived state.
>>>>>
>>>> Can you please append some logs from mesos-slave and mesos-master?
>>>> There should be some logs in either master or slave telling you what is
>>>> wrong.
>>>>
>>>>>
>>>>> *Please let me know what could be the reason. All the nodes are in the
>>>>> same network. *
>>>>>
>>>>> When I try to schedule a task using
>>>>>
>>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>>> --resources="cpus(*):3;mem(*):2560"
>>>>>
>>>>> The tasks always get scheduled on the same node. The resources from
>>>>> the other nodes are not getting used to schedule the tasks.
>>>>>
>>>> Based on your previous question, there is only one node in your
>>>> cluster, that's why other nodes are not available. We need first identify
>>>> what is wrong with other three nodes first.
>>>>
>>>>>
>>>>> I*s it required to register the frameworks from every slave node on
>>>>> the Master?*
>>>>>
>>>> It is not required.
>>>>
>>>>>
>>>>> *I have configured this cluster using the git-hub code.*
>>>>>
>>>>>
>>>>> Thanks & Regards,
>>>>> Pradeep
>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Ondrej,
Thanks for your reply
I did solve that issue, yes you are right there was an issue with slave IP
address setting.
Now I am facing issue with the scheduling the tasks. When I try to schedule
a task using
/src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
--command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
--resources="cpus(*):3;mem(*):2560"
The tasks always get scheduled on the same node. The resources from the
other nodes are not getting used to schedule the tasks.
I just start the mesos slaves like below
./bin/mesos-slave.sh --master=192.168.0.102:5050/mesos --hostname=slave1
If I submit the task using the above (mesos-execute) command from same as
one of the slave it runs on that system.
But when I submit the task from some different system. It uses just that
system and queues the tasks not runs on the other slaves.
Some times I see the message "Failed to getgid: unknown user"
Do I need to start some process to push the task on all the slaves equally?
Am I missing something here?
Regards,
Pradeep
On 2 October 2015 at 15:07, Ondrej Smola <on...@gmail.com> wrote:
> Hi Pradeep,
>
> the problem is with IP your slave advertise - mesos by default resolves
> your hostname - there are several solutions (let say your node ip is
> 192.168.56.128)
>
> 1) export LIBPROCESS_IP=192.168.56.128
> 2) set mesos options - ip, hostname
>
> one way to do this is to create files
>
> echo "192.168.56.128" > /etc/mesos-slave/ip
> echo "abc.mesos.com" > /etc/mesos-slave/hostname
>
> for more configuration options see
> http://mesos.apache.org/documentation/latest/configuration
>
>
>
>
>
> 2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
>
>> Hi Guangya,
>>
>> Thanks for reply. I found one interesting log message.
>>
>> 7410 master.cpp:5977] Removed slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
>> registered at the same address
>>
>> Mostly because of this issue, the systems/slave nodes are getting
>> registered and de-registered to make a room for the next node. I can even
>> see this on
>> the UI interface, for some time one node got added and after some time
>> that will be replaced with the new slave node.
>>
>> The above log is followed by the below log messages.
>>
>>
>> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 bytes)
>> to leveldb took 104089ns
>> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
>> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
>> with fd 15: Transport endpoint is not connected
>> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
>> ports(*):[31000-32000]
>> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>> (192.168.0.116) disconnected
>> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
>> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
>> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>> (192.168.0.116)
>> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
>> with fd 16: Transport endpoint is not connected
>> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
>> (192.168.0.116)
>> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
>> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
>> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
>> notice for position 384
>> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 bytes)
>> to leveldb took 95171ns
>> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from
>> leveldb took 20333ns
>> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>>
>>
>> Thanks,
>> Pradeep
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>>
>>> Hi Pradeep,
>>>
>>> Please check some of my questions in line.
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>>> pradeepkiruvale@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>>> Slaves.
>>>>
>>>> One slave runs on the Master Node itself and Other slaves run on
>>>> different nodes. Here node means the physical boxes.
>>>>
>>>> I tried running the tasks by configuring one Node cluster. Tested the
>>>> task scheduling using mesos-execute, works fine.
>>>>
>>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>>> see the resources on the master (in GUI) only the Master node resources are
>>>> visible.
>>>> The other nodes resources are not visible. Some times visible but in a
>>>> de-actived state.
>>>>
>>> Can you please append some logs from mesos-slave and mesos-master? There
>>> should be some logs in either master or slave telling you what is wrong.
>>>
>>>>
>>>> *Please let me know what could be the reason. All the nodes are in the
>>>> same network. *
>>>>
>>>> When I try to schedule a task using
>>>>
>>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>>> --resources="cpus(*):3;mem(*):2560"
>>>>
>>>> The tasks always get scheduled on the same node. The resources from the
>>>> other nodes are not getting used to schedule the tasks.
>>>>
>>> Based on your previous question, there is only one node in your cluster,
>>> that's why other nodes are not available. We need first identify what is
>>> wrong with other three nodes first.
>>>
>>>>
>>>> I*s it required to register the frameworks from every slave node on
>>>> the Master?*
>>>>
>>> It is not required.
>>>
>>>>
>>>> *I have configured this cluster using the git-hub code.*
>>>>
>>>>
>>>> Thanks & Regards,
>>>> Pradeep
>>>>
>>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Ondrej Smola <on...@gmail.com>.
Hi Pradeep,
the problem is with IP your slave advertise - mesos by default resolves
your hostname - there are several solutions (let say your node ip is
192.168.56.128)
1) export LIBPROCESS_IP=192.168.56.128
2) set mesos options - ip, hostname
one way to do this is to create files
echo "192.168.56.128" > /etc/mesos-slave/ip
echo "abc.mesos.com" > /etc/mesos-slave/hostname
for more configuration options see
http://mesos.apache.org/documentation/latest/configuration
2015-10-02 10:06 GMT+02:00 Pradeep Kiruvale <pr...@gmail.com>:
> Hi Guangya,
>
> Thanks for reply. I found one interesting log message.
>
> 7410 master.cpp:5977] Removed slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
> registered at the same address
>
> Mostly because of this issue, the systems/slave nodes are getting
> registered and de-registered to make a room for the next node. I can even
> see this on
> the UI interface, for some time one node got added and after some time
> that will be replaced with the new slave node.
>
> The above log is followed by the below log messages.
>
>
> I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 bytes)
> to leveldb took 104089ns
> I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
> E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
> with fd 15: Transport endpoint is not connected
> I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
> ports(*):[31000-32000]
> I1002 10:01:12.754065 7413 master.cpp:1080] Slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116) disconnected
> I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
> mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
> I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116)
> E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
> with fd 16: Transport endpoint is not connected
> I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
> (192.168.0.116)
> I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
> 6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
> I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
> notice for position 384
> I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 bytes)
> to leveldb took 95171ns
> I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from leveldb
> took 20333ns
> I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
>
>
> Thanks,
> Pradeep
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
>
>> Hi Pradeep,
>>
>> Please check some of my questions in line.
>>
>> Thanks,
>>
>> Guangya
>>
>> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
>> pradeepkiruvale@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>>> Slaves.
>>>
>>> One slave runs on the Master Node itself and Other slaves run on
>>> different nodes. Here node means the physical boxes.
>>>
>>> I tried running the tasks by configuring one Node cluster. Tested the
>>> task scheduling using mesos-execute, works fine.
>>>
>>> When I configure three Node cluster (1master and 3 slaves) and try to
>>> see the resources on the master (in GUI) only the Master node resources are
>>> visible.
>>> The other nodes resources are not visible. Some times visible but in a
>>> de-actived state.
>>>
>> Can you please append some logs from mesos-slave and mesos-master? There
>> should be some logs in either master or slave telling you what is wrong.
>>
>>>
>>> *Please let me know what could be the reason. All the nodes are in the
>>> same network. *
>>>
>>> When I try to schedule a task using
>>>
>>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>>> --resources="cpus(*):3;mem(*):2560"
>>>
>>> The tasks always get scheduled on the same node. The resources from the
>>> other nodes are not getting used to schedule the tasks.
>>>
>> Based on your previous question, there is only one node in your cluster,
>> that's why other nodes are not available. We need first identify what is
>> wrong with other three nodes first.
>>
>>>
>>> I*s it required to register the frameworks from every slave node on the
>>> Master?*
>>>
>> It is not required.
>>
>>>
>>> *I have configured this cluster using the git-hub code.*
>>>
>>>
>>> Thanks & Regards,
>>> Pradeep
>>>
>>>
>>
>
Re: Running a task in Mesos cluster
Posted by Pradeep Kiruvale <pr...@gmail.com>.
Hi Guangya,
Thanks for reply. I found one interesting log message.
7410 master.cpp:5977] Removed slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S52 (192.168.0.178): a new slave
registered at the same address
Mostly because of this issue, the systems/slave nodes are getting
registered and de-registered to make a room for the next node. I can even
see this on
the UI interface, for some time one node got added and after some time that
will be replaced with the new slave node.
The above log is followed by the below log messages.
I1002 10:01:12.753865 7416 leveldb.cpp:343] Persisting action (18 bytes)
to leveldb took 104089ns
I1002 10:01:12.753885 7416 replica.cpp:679] Persisted action at 384
E1002 10:01:12.753891 7417 process.cpp:1912] Failed to shutdown socket
with fd 15: Transport endpoint is not connected
I1002 10:01:12.753988 7413 master.cpp:3930] Registered slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
(192.168.0.116) with cpus(*):8; mem(*):14930; disk(*):218578;
ports(*):[31000-32000]
I1002 10:01:12.754065 7413 master.cpp:1080] Slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
(192.168.0.116) disconnected
I1002 10:01:12.754072 7416 hierarchical.hpp:675] Added slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 (192.168.0.116) with cpus(*):8;
mem(*):14930; disk(*):218578; ports(*):[31000-32000] (allocated: )
I1002 10:01:12.754084 7413 master.cpp:2534] Disconnecting slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
(192.168.0.116)
E1002 10:01:12.754118 7417 process.cpp:1912] Failed to shutdown socket
with fd 16: Transport endpoint is not connected
I1002 10:01:12.754132 7413 master.cpp:2553] Deactivating slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 at slave(1)@127.0.1.1:5051
(192.168.0.116)
I1002 10:01:12.754237 7416 hierarchical.hpp:768] Slave
6a11063e-b8ff-43bd-86cf-e6eef0de06fd-S62 deactivated
I1002 10:01:12.754240 7413 replica.cpp:658] Replica received learned
notice for position 384
I1002 10:01:12.754360 7413 leveldb.cpp:343] Persisting action (20 bytes)
to leveldb took 95171ns
I1002 10:01:12.754395 7413 leveldb.cpp:401] Deleting ~2 keys from leveldb
took 20333ns
I1002 10:01:12.754406 7413 replica.cpp:679] Persisted action at 384
Thanks,
Pradeep
On 2 October 2015 at 02:35, Guangya Liu <gy...@gmail.com> wrote:
> Hi Pradeep,
>
> Please check some of my questions in line.
>
> Thanks,
>
> Guangya
>
> On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <
> pradeepkiruvale@gmail.com> wrote:
>
>> Hi All,
>>
>> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
>> Slaves.
>>
>> One slave runs on the Master Node itself and Other slaves run on
>> different nodes. Here node means the physical boxes.
>>
>> I tried running the tasks by configuring one Node cluster. Tested the
>> task scheduling using mesos-execute, works fine.
>>
>> When I configure three Node cluster (1master and 3 slaves) and try to see
>> the resources on the master (in GUI) only the Master node resources are
>> visible.
>> The other nodes resources are not visible. Some times visible but in a
>> de-actived state.
>>
> Can you please append some logs from mesos-slave and mesos-master? There
> should be some logs in either master or slave telling you what is wrong.
>
>>
>> *Please let me know what could be the reason. All the nodes are in the
>> same network. *
>>
>> When I try to schedule a task using
>>
>> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
>> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
>> --resources="cpus(*):3;mem(*):2560"
>>
>> The tasks always get scheduled on the same node. The resources from the
>> other nodes are not getting used to schedule the tasks.
>>
> Based on your previous question, there is only one node in your cluster,
> that's why other nodes are not available. We need first identify what is
> wrong with other three nodes first.
>
>>
>> I*s it required to register the frameworks from every slave node on the
>> Master?*
>>
> It is not required.
>
>>
>> *I have configured this cluster using the git-hub code.*
>>
>>
>> Thanks & Regards,
>> Pradeep
>>
>>
>
Re: Running a task in Mesos cluster
Posted by Guangya Liu <gy...@gmail.com>.
Hi Pradeep,
Please check some of my questions in line.
Thanks,
Guangya
On Fri, Oct 2, 2015 at 12:55 AM, Pradeep Kiruvale <pradeepkiruvale@gmail.com
> wrote:
> Hi All,
>
> I am new to Mesos. I have set up a Mesos cluster with 1 Master and 3
> Slaves.
>
> One slave runs on the Master Node itself and Other slaves run on different
> nodes. Here node means the physical boxes.
>
> I tried running the tasks by configuring one Node cluster. Tested the task
> scheduling using mesos-execute, works fine.
>
> When I configure three Node cluster (1master and 3 slaves) and try to see
> the resources on the master (in GUI) only the Master node resources are
> visible.
> The other nodes resources are not visible. Some times visible but in a
> de-actived state.
>
Can you please append some logs from mesos-slave and mesos-master? There
should be some logs in either master or slave telling you what is wrong.
>
> *Please let me know what could be the reason. All the nodes are in the
> same network. *
>
> When I try to schedule a task using
>
> /src/mesos-execute --master=192.168.0.102:5050 --name="cluster-test"
> --command="/usr/bin/hackbench -s 4096 -l 10845760 -g 2 -f 2 -P"
> --resources="cpus(*):3;mem(*):2560"
>
> The tasks always get scheduled on the same node. The resources from the
> other nodes are not getting used to schedule the tasks.
>
Based on your previous question, there is only one node in your cluster,
that's why other nodes are not available. We need first identify what is
wrong with other three nodes first.
>
> I*s it required to register the frameworks from every slave node on the
> Master?*
>
It is not required.
>
> *I have configured this cluster using the git-hub code.*
>
>
> Thanks & Regards,
> Pradeep
>
>