You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@apache.org> on 2016/09/01 07:12:16 UTC

Re: jobs are stuck in agents and staying in stagged state

AFAICT, your agent is listening on port 8082 and not the default 5051.
---------
I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
128.226.116.69:8082
------------

The fact that agent is receiving a task from the master means that the
firewall on the agent allows incoming connections to 8082. So I'm surprised
that a local connection from the executor to the agent is being denied.
What exactly are your firewall rules on the agent?

Also, can you share the stderr/stdout of an example executor?


On Wed, Aug 31, 2016 at 6:18 PM, Pankaj Saha <ps...@binghamton.edu> wrote:

> I think the executor wants to get registered by communicating with
> mesos master and it fails due to network restriction.
> How can I change the /tmp/ path? I have mentioned /var/lib/mesos  as my
> work_dir.
>
>
> *I am explaining my setup here:*
>  I have a Mesos setup where master and slave both are running on the same
> network of my university campus. Mesos agent node is situated under a
> firewall and only port: 5000 to port:6000 are open for incoming traffic
> whereas Mesos master has no such restrictions. I am running master service
> on master:5050 and agent is running on agent:5051 as default.
>
> I can see agent is communicating correctly to master and offering the
> available resources. I have mentioned the available ports for agents  are
> ports:[5001-6000] in *src/slave/constants.cpp* file so that framework can
> communicate only through those ports which are open for my agent system
> behind the firewall.
>
> Now when I am launching jobs through Mesosphere marathon framework, I can
> see all jobs are connected to mesos-agent through those mentioned port
> ranges[5001-6000]. But my jobs are not getting submitted. So I started
> debugging and realised that when launching jobs mesos slaves create and
> launch an executor (*/erc/executor/executor.cpp*) which communicates to the
> mesos master through a random port. Which is outside my available range of
> 5000-6000 open ports. Now as through those ports my agent machine can not
> take any requests so executor is getting timed out and restarting the
> executor again and again after every 1 min of time limit.
>
> I could not find out where exactly that random port is assigned. Is there
> any socket connection that we can change to get executor connection happen
> on desired range of ports? Please let me know if my understanding is
> correct and how can I change those ports for executor registration.
>
>
>
>
> On Wed, Aug 31, 2016 at 3:09 AM, haosdent <ha...@gmail.com> wrote:
>
> > >I0829 14:27:38.322805  2700 slave.cpp:4307] *Terminating executor
> > ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did not register
> > within 1mins
> >
> > This log looks wired. Could you find anything in the stdout/stderr of the
> > executor. For the executor 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > above, it should be under the folder '/tmp/mesos/slaves/d6f0e3e2-
> > d144-4275-9d38-82327408622b-S12/frameworks/c796100f-9ecb-
> > 46fa-90a2-72ad649c5dd3-0000/executors/test.1fb85a35-6e16-
> > 11e6-bec9-c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/'
> >
> > Apart from that, run mesos under '/tmp' is not recommended.
> >
> > On Tue, Aug 30, 2016 at 2:32 AM, Pankaj Saha <ps...@binghamton.edu>
> > wrote:
> >
> > > here is the log:
> > >
> > >
> > >
> > > I0829 14:24:21.727960 2679 main.cpp:223] Build: 2016-08-28 13:39:46 by
> > > root
> > > I0829 14:24:21.728159  2679 main.cpp:225] Version: 0.28.2
> > > I0829 14:24:21.733256  2679 containerizer.cpp:149] Using isolation:
> > > posix/cpu,posix/mem,filesystem/posix
> > > I0829 14:24:21.738895  2679 linux_launcher.cpp:101] Using
> > > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> > > I0829 14:24:21.748019  2679 main.cpp:328] Starting Mesos slave
> > > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > > 128.226.116.69:8082
> > > I0829 14:24:21.750114  2679 slave.cpp:194] Flags at startup:
> > > --advertise_ip="128.226.116.69" --appc_simple_discovery_uri_
> > > prefix="http://"
> > > --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5"
> > > --cgroups_cpu_enable_pids_and_tids_count="false"
> > > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> > > --cgroups_limit_swap="false" --cgroups_root="mesos"
> > > --container_disk_watch_interval="15secs" --containerizers="mesos"
> > > --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> > > --docker_kill_orphans="true" --docker_registry="https://
> > > registry-1.docker.io"
> > > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock"
> > > --docker_stop_timeout="0ns" --docker_store_dir="/tmp/
> mesos/store/docker"
> > > --enforce_container_disk_quota="false"
> > > --executor_registration_timeout="1mins"
> > > --executor_shutdown_grace_period="5secs"
> > > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> > > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> > > --hadoop_home="" --help="false" --hostname_lookup="true"
> > > --image_provisioner_backend="copy" --initialize_driver_logging="true"
> > > --isolation="posix/cpu,posix/mem"
> > > --launcher_dir="/home/pankaj/mesos-0.28.2/build/src" --logbufsecs="0"
> > > --logging_level="INFO" --master="129.114.110.143:5050"
> > > --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> > > --perf_interval="1mins" --port="8082" --qos_correction_interval_min=
> > "0ns"
> > > --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> > > --registration_backoff_factor="1secs" --revocable_cpu_low_priority="
> > true"
> > > --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> > > --switch_user="true" --systemd_enable_support="true"
> > > --systemd_runtime_directory="/run/systemd/system" --version="false"
> > > --work_dir="/tmp/mesos"
> > > I0829 14:24:21.753572  2679 slave.cpp:464] Slave resources: cpus(*):2;
> > > mem(*):2855; disk(*):84691; ports(*):[8081-8081]
> > > I0829 14:24:21.753706  2679 slave.cpp:472] Slave attributes: [  ]
> > > I0829 14:24:21.753762  2679 slave.cpp:477] Slave hostname:
> > > venom.cs.binghamton.edu
> > > I0829 14:24:21.770992 2696 state.cpp:58] Recovering state from
> > > '/tmp/mesos/meta'
> > > I0829 14:24:21.771304  2696 state.cpp:698] No checkpointed resources
> > found
> > > at '/tmp/mesos/meta/resources/resources.info'
> > > I0829 14:24:21.771644  2696 state.cpp:101] Failed to find the latest
> > slave
> > > from '/tmp/mesos/meta'
> > > I0829 14:24:21.772583  2696 status_update_manager.cpp:200] Recovering
> > > status update manager
> > > I0829 14:24:21.773082  2698 containerizer.cpp:407] Recovering
> > containerizer
> > > I0829 14:24:21.777489  2702 provisioner.cpp:245] Provisioner recovery
> > > complete
> > > I0829 14:24:21.778149  2699 slave.cpp:4550] Finished recovery
> > > I0829 14:24:21.779564  2699 slave.cpp:796] New master detected at
> > > master@129.114.110.143:5050
> > > I0829 14:24:21.779742 2697 status_update_manager.cpp:174] Pausing
> > sending
> > > status updates
> > > I0829 14:24:21.780607  2699 slave.cpp:821] No credentials provided.
> > > Attempting to register without authentication
> > > I0829 14:24:21.781394  2699 slave.cpp:832] Detecting new master
> > > I0829 14:24:22.698812  2702 slave.cpp:971] Registered with master
> > > master@129.114.110.143:5050; given slave ID
> > > d6f0e3e2-d144-4275-9d38-82327408622b-S12
> > > I0829 14:24:22.699113  2698 status_update_manager.cpp:181] Resuming
> > sending
> > > status updates
> > > I0829 14:24:22.700258  2702 slave.cpp:1030] Forwarding total
> > oversubscribed
> > > resources
> > > I0829 14:24:43.638958  2695 http.cpp:190] HTTP GET for /slave(1)/state
> > from
> > > 128.226.119.78:59261 with User-Agent='Mozilla/5.0 (X11; Linux x86_64)
> > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86
> Safari/537.36'
> > > I0829 14:25:21.764268  2702 slave.cpp:4359] Current disk usage 9.67%.
> Max
> > > allowed age: 5.622868987169502days
> > > I0829 14:26:21.778849  2695 slave.cpp:4359] Current disk usage 9.67%.
> Max
> > > allowed age: 5.622860462326585days
> > > I0829 14:26:38.271085  2698 slave.cpp:1361] Got assigned task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:26:38.311063  2698 slave.cpp:1480] Launching task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:26:38.314755  2698 paths.cpp:528] Trying to chown
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > to user 'root'
> > > I0829 14:26:38.320300  2698 slave.cpp:5352] Launching executor
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources cpus(*):0.1;
> > > mem(*):32 in work directory
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > I0829 14:26:38.321523  2702 containerizer.cpp:666] Starting container
> > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' for executor
> > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > I0829 14:26:38.322588  2698 slave.cpp:1698] Queuing task
> > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' for executor
> > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:26:38.358906  2702 linux_launcher.cpp:304] Cloning child
> process
> > > with flags =
> > > I0829 14:26:38.366492  2702 containerizer.cpp:1179] Checkpointing
> > > executor's forked pid 2758 to
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/
> pids/forked.pid'
> > > I0829 14:27:21.779755  2701 slave.cpp:4359] Current disk usage 9.67%.
> Max
> > > allowed age: 5.622850415190289days
> > > I0829 14:27:38.322805  2700 slave.cpp:4307]
> > > *Terminating executor ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of
> > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did
> not
> > > register within 1minsI0829 14:27:38.323226 2700 containerizer.cpp:1453]
> > > Destroying container 'dff399f0-beb1-4c49-bd8e-c19621de2f71'*
> > > I0829 14:27:38.329186  2702 cgroups.cpp:2427] Freezing cgroup
> > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > I0829 14:27:38.331509  2699 cgroups.cpp:1409] Successfully froze cgroup
> > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> after
> > > 2.19392ms
> > > I0829 14:27:38.334520  2698 cgroups.cpp:2445] Thawing cgroup
> > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > I0829 14:27:38.337821  2698 cgroups.cpp:1438] Successfullly thawed
> cgroup
> > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> after
> > > 3.194112ms
> > > I0829 14:27:38.435214  2696 containerizer.cpp:1689] Executor for
> > container
> > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' has exited
> > > I0829 14:27:38.441556  2695 provisioner.cpp:306] Ignoring destroy
> request
> > > for unknown container dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > I0829 14:27:38.442186  2695 slave.cpp:3871] Executor
> > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 terminated with signal
> Killed
> > > I0829 14:27:38.445689  2695 slave.cpp:3012] Handling status update
> > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 from @0.0.0.0:0
> > > W0829 14:27:38.447599  2702 containerizer.cpp:1295] Ignoring update for
> > > unknown container: dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > I0829 14:27:38.448391  2702 status_update_manager.cpp:320] Received
> > status
> > > update TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for
> task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.449525  2702 status_update_manager.cpp:824]
> Checkpointing
> > > UPDATE for status update TASK_FAILED (UUID:
> > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.523027  2696 slave.cpp:3410] Forwarding the update
> > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 to
> master@129.114.110.143:5050
> > > I0829 14:27:38.627722  2698 status_update_manager.cpp:392] Received
> > status
> > > update acknowledgement (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d)
> for
> > > task test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.627943  2698 status_update_manager.cpp:824]
> Checkpointing
> > > ACK for status update TASK_FAILED (UUID:
> > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.698822  2701 slave.cpp:3975] Cleaning up executor
> > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.699582  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > for gc 6.99999190486222days in the future
> > > I0829 14:27:38.700202  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > for gc 6.99999190029037days in the future
> > > I0829 14:27:38.700382  2701 slave.cpp:4063] Cleaning up framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.700443  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > for gc 6.99999189796148days in the future
> > > I0829 14:27:38.700649  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > for gc 6.99999189622815days in the future
> > > I0829 14:27:38.700845  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > for gc 6.99999189143704days in the future
> > > I0829 14:27:38.701015  2701 status_update_manager.cpp:282] Closing
> status
> > > update streams for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:38.701161  2698 gc.cpp:55] Scheduling
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > for gc 6.9999918900237days in the future
> > > I0829 14:27:39.651463  2697 slave.cpp:1361] Got assigned task
> > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:39.655815  2696 gc.cpp:83] Unscheduling
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > from gc
> > > I0829 14:27:39.656445  2696 gc.cpp:83] Unscheduling
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > from gc
> > > I0829 14:27:39.656855  2702 slave.cpp:1480] Launching task
> > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:39.660585  2702 paths.cpp:528] Trying to chown
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > to user 'root'
> > > I0829 14:27:39.666008  2702 slave.cpp:5352] Launching executor
> > > test.445696e6-6e16-11e6-bec9-c27afc834a0c of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources cpus(*):0.1;
> > > mem(*):32 in work directory
> > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > I0829 14:27:39.667603  2702 slave.cpp:1698] Queuing task
> > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' for executor
> > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > I0829 14:27:39.668207  2702 containerizer.cpp:666] Starting container
> > > 'ea676570-0a2a-49c3-a75c-14e045eb842b' for executor
> > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > I0829 14:27:39.678665  2702 linux_launcher.cpp:304] Cloning child
> process
> > > with flags =
> > > I0829 14:27:39.681824  2702 containerizer.cpp:1179] Checkpointing
> > > executor's forked pid 2799 to
> > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > 72ad649c5dd3-0000/executors/test.445696e6-6e16-11e6-bec9-
> > > c27afc834a0c/runs/ea676570-0a2a-49c3-a75c-14e045eb842b/
> pids/forked.pid'
> > >
> > > Thanks
> > > Pankaj
> > >
> > >
> > > On Mon, Aug 29, 2016 at 7:25 AM, haosdent <ha...@gmail.com> wrote:
> > >
> > > > Hi, @Pankaj, Could you provide logs during " the job is getting
> > restarted
> > > > and a new container is created with a new process id. ". The logs you
> > > > provided looks normal.
> > > >
> > > > On Mon, Aug 29, 2016 at 5:26 AM, Pankaj Saha <ps...@binghamton.edu>
> > > > wrote:
> > > >
> > > > > Hi
> > > > > I am facing an issue with a launched jobs into my mesos agents. I
> am
> > > > trying
> > > > > to launch a job through marathon framework and job is staying in
> > > stagged
> > > > > state and not running.
> > > > > I could see the log message at the agent console as below:
> > > > >
> > > > >  Scheduling
> > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > 9d38-82327408622b-S8/
> > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > for gc 6.99999884239407days in the future
> > > > > I0828 16:20:36.053483 28512 slave.cpp:1361] *Got assigned task
> > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0828 16:20:36.056224 28510 gc.cpp:83] Unscheduling
> > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0828 16:20:36.056715 28510 gc.cpp:83] Unscheduling
> > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > 9d38-82327408622b-S8/
> > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0828 16:20:36.057231 28509 slave.cpp:1480] *Launching task
> > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0828 16:20:36.058661 28509 paths.cpp:528]* Trying to chown*
> > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-13adb145c12d'
> > > > > to user 'root'
> > > > > I0828 16:20:36.067807 28509 slave.cpp:5352]* Launching executor
> > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > of framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with
> > resources
> > > > > cpus(*):0.1; mem(*):32 in work directory
> > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-13adb145c12d'
> > > > > I0828 16:20:36.069314 28509 slave.cpp:1698] *Queuing task
> > > > > 'test-crixus.*eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
> > > > > for executor 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of
> > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0828 16:20:36.069902 28509 containerizer.cpp:666] *Starting
> > container*
> > > > > '99620406-87b5-406c-a88b-13adb145c12d' for executor
> > > > > 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of framework
> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > I0828 16:20:36.080713 28509 linux_launcher.cpp:304] *Cloning child
> > > > process*
> > > > > with flags =
> > > > > I0828 16:20:36.084738 28509 containerizer.cpp:1179] *Checkpointing
> > > > > executor's forked pid 29629* to
> > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > 9d38-82327408622b-S8/
> > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test-crixus.eb66a42b-6d5c-11e6-bec9-
> > > > c27afc834a0c/runs/99620406-
> > > > > 87b5-406c-a88b-13adb145c12d/pids/forked.pid'
> > > > >
> > > > >
> > > > > But after that, the job is getting restarted and a new container is
> > > > created
> > > > > with a new process id. It happening infinitely which is keeping the
> > job
> > > > in
> > > > > stagged state to mesos-master.
> > > > >
> > > > > This job is nothing but a simle echo "hello world" kind of shell
> > > command.
> > > > > Can anyone please point out where its failing or I am doing wrong.
> > > > >
> > > > >
> > > > >
> > > > > Thanks
> > > > > Pankaj
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > > >
> > >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>

Re: jobs are stuck in agents and staying in stagged state

Posted by Erik Weathers <ew...@groupon.com.INVALID>.
You need to ensure there is no overlap between these 3 things:

1. Static ports for mesos agent/master, e.g. 5050, 5051 by default.
2. Linux ephemeral port range (32768-61000 by default).
3. Mesos view of ports as resources (31000-32000 by default).

The behavior initially described here sounds to me like a scheduler may
have been offered an already used port on the agent host.

I answered a related question before:

http://unix.stackexchange.com/a/237543

- Erik

On Thursday, September 1, 2016, haosdent <ha...@gmail.com> wrote:

> If you use Linux, you could execute follow command on every Mesos Agent to
> make the ephemeral port assigned during 5000~6000
>
> echo "5000 6000" > /proc/sys/net/ipv4/ip_local_port_range
>
> On Thu, Sep 1, 2016 at 3:12 PM, Vinod Kone <vinodkone@apache.org
> <javascript:;>> wrote:
>
> > AFAICT, your agent is listening on port 8082 and not the default 5051.
> > ---------
> > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > 128.226.116.69:8082
> > ------------
> >
> > The fact that agent is receiving a task from the master means that the
> > firewall on the agent allows incoming connections to 8082. So I'm
> surprised
> > that a local connection from the executor to the agent is being denied.
> > What exactly are your firewall rules on the agent?
> >
> > Also, can you share the stderr/stdout of an example executor?
> >
> >
> > On Wed, Aug 31, 2016 at 6:18 PM, Pankaj Saha <psaha4@binghamton.edu
> <javascript:;>>
> > wrote:
> >
> > > I think the executor wants to get registered by communicating with
> > > mesos master and it fails due to network restriction.
> > > How can I change the /tmp/ path? I have mentioned /var/lib/mesos  as my
> > > work_dir.
> > >
> > >
> > > *I am explaining my setup here:*
> > >  I have a Mesos setup where master and slave both are running on the
> same
> > > network of my university campus. Mesos agent node is situated under a
> > > firewall and only port: 5000 to port:6000 are open for incoming traffic
> > > whereas Mesos master has no such restrictions. I am running master
> > service
> > > on master:5050 and agent is running on agent:5051 as default.
> > >
> > > I can see agent is communicating correctly to master and offering the
> > > available resources. I have mentioned the available ports for agents
> are
> > > ports:[5001-6000] in *src/slave/constants.cpp* file so that framework
> can
> > > communicate only through those ports which are open for my agent system
> > > behind the firewall.
> > >
> > > Now when I am launching jobs through Mesosphere marathon framework, I
> can
> > > see all jobs are connected to mesos-agent through those mentioned port
> > > ranges[5001-6000]. But my jobs are not getting submitted. So I started
> > > debugging and realised that when launching jobs mesos slaves create and
> > > launch an executor (*/erc/executor/executor.cpp*) which communicates to
> > the
> > > mesos master through a random port. Which is outside my available range
> > of
> > > 5000-6000 open ports. Now as through those ports my agent machine can
> not
> > > take any requests so executor is getting timed out and restarting the
> > > executor again and again after every 1 min of time limit.
> > >
> > > I could not find out where exactly that random port is assigned. Is
> there
> > > any socket connection that we can change to get executor connection
> > happen
> > > on desired range of ports? Please let me know if my understanding is
> > > correct and how can I change those ports for executor registration.
> > >
> > >
> > >
> > >
> > > On Wed, Aug 31, 2016 at 3:09 AM, haosdent <haosdent@gmail.com
> <javascript:;>> wrote:
> > >
> > > > >I0829 14:27:38.322805  2700 slave.cpp:4307] *Terminating executor
> > > > ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did not
> register
> > > > within 1mins
> > > >
> > > > This log looks wired. Could you find anything in the stdout/stderr of
> > the
> > > > executor. For the executor 'test.1fb85a35-6e16-11e6-bec9-
> c27afc834a0c'
> > > > above, it should be under the folder '/tmp/mesos/slaves/d6f0e3e2-
> > > > d144-4275-9d38-82327408622b-S12/frameworks/c796100f-9ecb-
> > > > 46fa-90a2-72ad649c5dd3-0000/executors/test.1fb85a35-6e16-
> > > > 11e6-bec9-c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/'
> > > >
> > > > Apart from that, run mesos under '/tmp' is not recommended.
> > > >
> > > > On Tue, Aug 30, 2016 at 2:32 AM, Pankaj Saha <psaha4@binghamton.edu
> <javascript:;>>
> > > > wrote:
> > > >
> > > > > here is the log:
> > > > >
> > > > >
> > > > >
> > > > > I0829 14:24:21.727960 2679 main.cpp:223] Build: 2016-08-28 13:39:46
> > by
> > > > > root
> > > > > I0829 14:24:21.728159  2679 main.cpp:225] Version: 0.28.2
> > > > > I0829 14:24:21.733256  2679 containerizer.cpp:149] Using isolation:
> > > > > posix/cpu,posix/mem,filesystem/posix
> > > > > I0829 14:24:21.738895  2679 linux_launcher.cpp:101] Using
> > > > > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux
> > launcher
> > > > > I0829 14:24:21.748019  2679 main.cpp:328] Starting Mesos slave
> > > > > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > > > > 128.226.116.69:8082
> > > > > I0829 14:24:21.750114  2679 slave.cpp:194] Flags at startup:
> > > > > --advertise_ip="128.226.116.69" --appc_simple_discovery_uri_
> > > > > prefix="http://"
> > > > > --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5"
> > > > > --cgroups_cpu_enable_pids_and_tids_count="false"
> > > > > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> > > > > --cgroups_limit_swap="false" --cgroups_root="mesos"
> > > > > --container_disk_watch_interval="15secs" --containerizers="mesos"
> > > > > --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> > > > > --docker_kill_orphans="true" --docker_registry="https://
> > > > > registry-1.docker.io"
> > > > > --docker_remove_delay="6hrs" --docker_socket="/var/run/
> docker.sock"
> > > > > --docker_stop_timeout="0ns" --docker_store_dir="/tmp/
> > > mesos/store/docker"
> > > > > --enforce_container_disk_quota="false"
> > > > > --executor_registration_timeout="1mins"
> > > > > --executor_shutdown_grace_period="5secs"
> > > > > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> > > > > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> > > > > --hadoop_home="" --help="false" --hostname_lookup="true"
> > > > > --image_provisioner_backend="copy" --initialize_driver_logging="
> > true"
> > > > > --isolation="posix/cpu,posix/mem"
> > > > > --launcher_dir="/home/pankaj/mesos-0.28.2/build/src"
> > --logbufsecs="0"
> > > > > --logging_level="INFO" --master="129.114.110.143:5050"
> > > > > --oversubscribed_resources_interval="15secs"
> > --perf_duration="10secs"
> > > > > --perf_interval="1mins" --port="8082"
> --qos_correction_interval_min=
> > > > "0ns"
> > > > > --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> > > > > --registration_backoff_factor="1secs"
> --revocable_cpu_low_priority="
> > > > true"
> > > > > --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> > > > > --switch_user="true" --systemd_enable_support="true"
> > > > > --systemd_runtime_directory="/run/systemd/system"
> --version="false"
> > > > > --work_dir="/tmp/mesos"
> > > > > I0829 14:24:21.753572  2679 slave.cpp:464] Slave resources:
> > cpus(*):2;
> > > > > mem(*):2855; disk(*):84691; ports(*):[8081-8081]
> > > > > I0829 14:24:21.753706  2679 slave.cpp:472] Slave attributes: [  ]
> > > > > I0829 14:24:21.753762  2679 slave.cpp:477] Slave hostname:
> > > > > venom.cs.binghamton.edu
> > > > > I0829 14:24:21.770992 2696 state.cpp:58] Recovering state from
> > > > > '/tmp/mesos/meta'
> > > > > I0829 14:24:21.771304  2696 state.cpp:698] No checkpointed
> resources
> > > > found
> > > > > at '/tmp/mesos/meta/resources/resources.info'
> > > > > I0829 14:24:21.771644  2696 state.cpp:101] Failed to find the
> latest
> > > > slave
> > > > > from '/tmp/mesos/meta'
> > > > > I0829 14:24:21.772583  2696 status_update_manager.cpp:200]
> Recovering
> > > > > status update manager
> > > > > I0829 14:24:21.773082  2698 containerizer.cpp:407] Recovering
> > > > containerizer
> > > > > I0829 14:24:21.777489  2702 provisioner.cpp:245] Provisioner
> recovery
> > > > > complete
> > > > > I0829 14:24:21.778149  2699 slave.cpp:4550] Finished recovery
> > > > > I0829 14:24:21.779564  2699 slave.cpp:796] New master detected at
> > > > > master@129.114.110.143:5050
> > > > > I0829 14:24:21.779742 2697 status_update_manager.cpp:174] Pausing
> > > > sending
> > > > > status updates
> > > > > I0829 14:24:21.780607  2699 slave.cpp:821] No credentials provided.
> > > > > Attempting to register without authentication
> > > > > I0829 14:24:21.781394  2699 slave.cpp:832] Detecting new master
> > > > > I0829 14:24:22.698812  2702 slave.cpp:971] Registered with master
> > > > > master@129.114.110.143:5050; given slave ID
> > > > > d6f0e3e2-d144-4275-9d38-82327408622b-S12
> > > > > I0829 14:24:22.699113  2698 status_update_manager.cpp:181] Resuming
> > > > sending
> > > > > status updates
> > > > > I0829 14:24:22.700258  2702 slave.cpp:1030] Forwarding total
> > > > oversubscribed
> > > > > resources
> > > > > I0829 14:24:43.638958  2695 http.cpp:190] HTTP GET for
> > /slave(1)/state
> > > > from
> > > > > 128.226.119.78:59261 with User-Agent='Mozilla/5.0 (X11; Linux
> > x86_64)
> > > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86
> > > Safari/537.36'
> > > > > I0829 14:25:21.764268  2702 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622868987169502days
> > > > > I0829 14:26:21.778849  2695 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622860462326585days
> > > > > I0829 14:26:38.271085  2698 slave.cpp:1361] Got assigned task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.311063  2698 slave.cpp:1480] Launching task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.314755  2698 paths.cpp:528] Trying to chown
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > to user 'root'
> > > > > I0829 14:26:38.320300  2698 slave.cpp:5352] Launching executor
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> > cpus(*):0.1;
> > > > > mem(*):32 in work directory
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > I0829 14:26:38.321523  2702 containerizer.cpp:666] Starting
> container
> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' for executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > I0829 14:26:38.322588  2698 slave.cpp:1698] Queuing task
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.358906  2702 linux_launcher.cpp:304] Cloning child
> > > process
> > > > > with flags =
> > > > > I0829 14:26:38.366492  2702 containerizer.cpp:1179] Checkpointing
> > > > > executor's forked pid 2758 to
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/
> > > pids/forked.pid'
> > > > > I0829 14:27:21.779755  2701 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622850415190289days
> > > > > I0829 14:27:38.322805  2700 slave.cpp:4307]
> > > > > *Terminating executor ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > of
> > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it
> did
> > > not
> > > > > register within 1minsI0829 14:27:38.323226 2700
> > containerizer.cpp:1453]
> > > > > Destroying container 'dff399f0-beb1-4c49-bd8e-c19621de2f71'*
> > > > > I0829 14:27:38.329186  2702 cgroups.cpp:2427] Freezing cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.331509  2699 cgroups.cpp:1409] Successfully froze
> > cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > after
> > > > > 2.19392ms
> > > > > I0829 14:27:38.334520  2698 cgroups.cpp:2445] Thawing cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.337821  2698 cgroups.cpp:1438] Successfullly thawed
> > > cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > after
> > > > > 3.194112ms
> > > > > I0829 14:27:38.435214  2696 containerizer.cpp:1689] Executor for
> > > > container
> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' has exited
> > > > > I0829 14:27:38.441556  2695 provisioner.cpp:306] Ignoring destroy
> > > request
> > > > > for unknown container dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.442186  2695 slave.cpp:3871] Executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 terminated with signal
> > > Killed
> > > > > I0829 14:27:38.445689  2695 slave.cpp:3012] Handling status update
> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 from @0.0.0.0:0
> > > > > W0829 14:27:38.447599  2702 containerizer.cpp:1295] Ignoring update
> > for
> > > > > unknown container: dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.448391  2702 status_update_manager.cpp:320] Received
> > > > status
> > > > > update TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d)
> for
> > > task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.449525  2702 status_update_manager.cpp:824]
> > > Checkpointing
> > > > > UPDATE for status update TASK_FAILED (UUID:
> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.523027  2696 slave.cpp:3410] Forwarding the update
> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 to
> > > master@129.114.110.143:5050
> > > > > I0829 14:27:38.627722  2698 status_update_manager.cpp:392] Received
> > > > status
> > > > > update acknowledgement (UUID: 154a67ea-f6d5-4e9c-ad3d-
> 9b09161ba34d)
> > > for
> > > > > task test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.627943  2698 status_update_manager.cpp:824]
> > > Checkpointing
> > > > > ACK for status update TASK_FAILED (UUID:
> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.698822  2701 slave.cpp:3975] Cleaning up executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.699582  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > for gc 6.99999190486222days in the future
> > > > > I0829 14:27:38.700202  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > > > for gc 6.99999190029037days in the future
> > > > > I0829 14:27:38.700382  2701 slave.cpp:4063] Cleaning up framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.700443  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > for gc 6.99999189796148days in the future
> > > > > I0829 14:27:38.700649  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > c27afc834a0c'
> > > > > for gc 6.99999189622815days in the future
> > > > > I0829 14:27:38.700845  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > for gc 6.99999189143704days in the future
> > > > > I0829 14:27:38.701015  2701 status_update_manager.cpp:282] Closing
> > > status
> > > > > update streams for framework c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000
> > > > > I0829 14:27:38.701161  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000'
> > > > > for gc 6.9999918900237days in the future
> > > > > I0829 14:27:39.651463  2697 slave.cpp:1361] Got assigned task
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.655815  2696 gc.cpp:83] Unscheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0829 14:27:39.656445  2696 gc.cpp:83] Unscheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0829 14:27:39.656855  2702 slave.cpp:1480] Launching task
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.660585  2702 paths.cpp:528] Trying to chown
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > > to user 'root'
> > > > > I0829 14:27:39.666008  2702 slave.cpp:5352] Launching executor
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> > cpus(*):0.1;
> > > > > mem(*):32 in work directory
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > > I0829 14:27:39.667603  2702 slave.cpp:1698] Queuing task
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.668207  2702 containerizer.cpp:666] Starting
> container
> > > > > 'ea676570-0a2a-49c3-a75c-14e045eb842b' for executor
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > I0829 14:27:39.678665  2702 linux_launcher.cpp:304] Cloning child
> > > process
> > > > > with flags =
> > > > > I0829 14:27:39.681824  2702 containerizer.cpp:1179] Checkpointing
> > > > > executor's forked pid 2799 to
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.445696e6-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/ea676570-0a2a-49c3-a75c-14e045eb842b/
> > > pids/forked.pid'
> > > > >
> > > > > Thanks
> > > > > Pankaj
> > > > >
> > > > >
> > > > > On Mon, Aug 29, 2016 at 7:25 AM, haosdent <haosdent@gmail.com
> <javascript:;>>
> > wrote:
> > > > >
> > > > > > Hi, @Pankaj, Could you provide logs during " the job is getting
> > > > restarted
> > > > > > and a new container is created with a new process id. ". The logs
> > you
> > > > > > provided looks normal.
> > > > > >
> > > > > > On Mon, Aug 29, 2016 at 5:26 AM, Pankaj Saha <
> > psaha4@binghamton.edu <javascript:;>>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > > I am facing an issue with a launched jobs into my mesos
> agents. I
> > > am
> > > > > > trying
> > > > > > > to launch a job through marathon framework and job is staying
> in
> > > > > stagged
> > > > > > > state and not running.
> > > > > > > I could see the log message at the agent console as below:
> > > > > > >
> > > > > > >  Scheduling
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > for gc 6.99999884239407days in the future
> > > > > > > I0828 16:20:36.053483 28512 slave.cpp:1361] *Got assigned task
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.056224 28510 gc.cpp:83] Unscheduling
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000'
> > > > > > > from gc
> > > > > > > I0828 16:20:36.056715 28510 gc.cpp:83] Unscheduling
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > from gc
> > > > > > > I0828 16:20:36.057231 28509 slave.cpp:1480] *Launching task
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.058661 28509 paths.cpp:528]* Trying to chown*
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> > 13adb145c12d'
> > > > > > > to user 'root'
> > > > > > > I0828 16:20:36.067807 28509 slave.cpp:5352]* Launching executor
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > of framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with
> > > > resources
> > > > > > > cpus(*):0.1; mem(*):32 in work directory
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> > 13adb145c12d'
> > > > > > > I0828 16:20:36.069314 28509 slave.cpp:1698] *Queuing task
> > > > > > > 'test-crixus.*eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
> > > > > > > for executor 'test-crixus.eb66a42b-6d5c-
> 11e6-bec9-c27afc834a0c'
> > of
> > > > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.069902 28509 containerizer.cpp:666] *Starting
> > > > container*
> > > > > > > '99620406-87b5-406c-a88b-13adb145c12d' for executor
> > > > > > > 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of
> framework
> > > > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > I0828 16:20:36.080713 28509 linux_launcher.cpp:304] *Cloning
> > child
> > > > > > process*
> > > > > > > with flags =
> > > > > > > I0828 16:20:36.084738 28509 containerizer.cpp:1179]
> > *Checkpointing
> > > > > > > executor's forked pid 29629* to
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > > > executors/test-crixus.eb66a42b-6d5c-11e6-bec9-
> > > > > > c27afc834a0c/runs/99620406-
> > > > > > > 87b5-406c-a88b-13adb145c12d/pids/forked.pid'
> > > > > > >
> > > > > > >
> > > > > > > But after that, the job is getting restarted and a new
> container
> > is
> > > > > > created
> > > > > > > with a new process id. It happening infinitely which is keeping
> > the
> > > > job
> > > > > > in
> > > > > > > stagged state to mesos-master.
> > > > > > >
> > > > > > > This job is nothing but a simle echo "hello world" kind of
> shell
> > > > > command.
> > > > > > > Can anyone please point out where its failing or I am doing
> > wrong.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > > Pankaj
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > Haosdent Huang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > > >
> > >
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>

announcing canary-releasing and autoscaling solution Vamp moving to beta

Posted by "olaf@magnetic.io" <ol...@magnetic.io>.
Hello everybody,

here a quick ping to announce that we’re moving our opensource canary-test/release and workflow-driven autoscaling solution Vamp to 0.9.0 and moving from alpha to beta stage.

We’d be very interested in any collaboration on improving Vamp and hearing your feedback on what we can improve or add.

https://github.com/magneticio/vamp/releases/tag/0.9.0 <https://github.com/magneticio/vamp/releases/tag/0.9.0>

Thanks! Olaf

Olaf Molenveld
co-founder / CEO
-------------------------------------
VAMP: the Canary test and release platform for containers by magnetic.io
E: olaf@magnetic.io
T: +31653362783
Skype: olafmol
www.vamp.io <http://www.vamp.io/>
www.magnetic.io <http://www.magnetic.io/>

announcing canary-releasing and autoscaling solution Vamp moving to beta

Posted by "olaf@magnetic.io" <ol...@magnetic.io>.
Hello everybody,

here a quick ping to announce that we’re moving our opensource canary-test/release and workflow-driven autoscaling solution Vamp to 0.9.0 and moving from alpha to beta stage.

We’d be very interested in any collaboration on improving Vamp and hearing your feedback on what we can improve or add.

https://github.com/magneticio/vamp/releases/tag/0.9.0 <https://github.com/magneticio/vamp/releases/tag/0.9.0>

Thanks! Olaf

Olaf Molenveld
co-founder / CEO
-------------------------------------
VAMP: the Canary test and release platform for containers by magnetic.io
E: olaf@magnetic.io
T: +31653362783
Skype: olafmol
www.vamp.io <http://www.vamp.io/>
www.magnetic.io <http://www.magnetic.io/>

Re: jobs are stuck in agents and staying in stagged state

Posted by Pankaj Saha <ps...@binghamton.edu>.
While digging into code and trying to figure out the code flow, i have
noticed something and i believe this could be the issue with my setup. As
my mesos agents have private ip and using firewalls public address as
--advertise_ip,  executor (ExecutorInfo) receives host as my firewall
address. where as the executors host should be localhost.

my executor command name-value pair looks like below:

command {
  environment {
    variables {
      name: "MARATHON_APP_VERSION"
      value: "2016-09-09T06:15:47.986Z"
    }
*    variables {*
*      name: "HOST"*
*      value: "csfirewall.binghamton.edu <http://venom.cs.binghamton.edu/>"*
*    }*
    variables {
      name: "MARATHON_APP_RESOURCE_CPUS"
      value: "1.0"
    }
    variables {
      name: "PORT_10000"
      value: "8083"
    }
    variables {
      name: "MESOS_TASK_ID"
      value: "test-job.34bfa7a5-7760-11e6-aaea-0cc47a49cf86"
    }
    variables {
      name: "PORT"
      value: "8083"
    }
    variables {
      name: "MARATHON_APP_RESOURCE_MEM"
      value: "128.0"
    }
    variables {
      name: "PORTS"
      value: "8083"
    }
    variables {
      name: "MARATHON_APP_RESOURCE_DISK"
      value: "128.0"
    }
    variables {
      name: "MARATHON_APP_LABELS"
      value: ""
    }
    variables {
      name: "MARATHON_APP_ID"
      value: "/test-job"
    }
    variables {
      name: "PORT0"
      value: "8083"
    }
  }
  value: "/home/pankaj/mesos-0.28.2/build/src/mesos-executor"
  shell: true
}

ExecutorInfo is a java class and being used by JNI. Can any one please help
me  figuring out how these executor info is getting built. I can change the
hostname value before executor trying to execute. I believe this will
resolve my job execution issue.


Thanks
Pankaj



On Thu, Sep 1, 2016 at 2:48 PM, Pankaj Saha <ps...@binghamton.edu> wrote:

> I am using my local system as a slave which is behind campus
> firewall. Firewall allows me to access only 8082 to 8090 ports for all
> incoming requests. My slave system is having a private ip address. So I am
> using a NAT conversion at the firewall which will convert all incoming
> requests to ports 8080-8090 ports on the firewall to my private ip machine.
>
> Firewall has a public ip: 128.x.y.z  and  mesos slave has a private ip:
> 10.11.12.13
>
> Here is the NAT rule:   128.x.y.z port 8082-8090 will change into
> 10.11.12.13 port 8082-8090
> I am using firewalls address as  *--advertise_ip=* *128.x.y.z *address,
> which is registering  *128.x.y.z:8082* as a slave. Now jobs are coming to
> my private ip machine through that NAT conversion but containers are
> failing and relaunching again and again.
>
> Mesos Master is running on 5050
> Mesos slave is running on 8082
> Mesos ports as resources are (8083-8084).
> Linux ephemeral port range is changed in slave machine to (8085-8090).
>
> I hope this explains my setup. I am planning to make changes if required
> in the source code to make this kind of set up working.
>
>
>
>
>
>
> On Thu, Sep 1, 2016 at 11:04 AM, haosdent <ha...@gmail.com> wrote:
>
>> If you use Linux, you could execute follow command on every Mesos Agent to
>> make the ephemeral port assigned during 5000~6000
>>
>> echo "5000 6000" > /proc/sys/net/ipv4/ip_local_port_range
>>
>> On Thu, Sep 1, 2016 at 3:12 PM, Vinod Kone <vi...@apache.org> wrote:
>>
>> > AFAICT, your agent is listening on port 8082 and not the default 5051.
>> > ---------
>> > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
>> > 128.226.116.69:8082
>> > ------------
>> >
>> > The fact that agent is receiving a task from the master means that the
>> > firewall on the agent allows incoming connections to 8082. So I'm
>> surprised
>> > that a local connection from the executor to the agent is being denied.
>> > What exactly are your firewall rules on the agent?
>> >
>> > Also, can you share the stderr/stdout of an example executor?
>> >
>> >
>> > On Wed, Aug 31, 2016 at 6:18 PM, Pankaj Saha <ps...@binghamton.edu>
>> > wrote:
>> >
>> > > I think the executor wants to get registered by communicating with
>> > > mesos master and it fails due to network restriction.
>> > > How can I change the /tmp/ path? I have mentioned /var/lib/mesos  as
>> my
>> > > work_dir.
>> > >
>> > >
>> > > *I am explaining my setup here:*
>> > >  I have a Mesos setup where master and slave both are running on the
>> same
>> > > network of my university campus. Mesos agent node is situated under a
>> > > firewall and only port: 5000 to port:6000 are open for incoming
>> traffic
>> > > whereas Mesos master has no such restrictions. I am running master
>> > service
>> > > on master:5050 and agent is running on agent:5051 as default.
>> > >
>> > > I can see agent is communicating correctly to master and offering the
>> > > available resources. I have mentioned the available ports for agents
>> are
>> > > ports:[5001-6000] in *src/slave/constants.cpp* file so that framework
>> can
>> > > communicate only through those ports which are open for my agent
>> system
>> > > behind the firewall.
>> > >
>> > > Now when I am launching jobs through Mesosphere marathon framework, I
>> can
>> > > see all jobs are connected to mesos-agent through those mentioned port
>> > > ranges[5001-6000]. But my jobs are not getting submitted. So I started
>> > > debugging and realised that when launching jobs mesos slaves create
>> and
>> > > launch an executor (*/erc/executor/executor.cpp*) which communicates
>> to
>> > the
>> > > mesos master through a random port. Which is outside my available
>> range
>> > of
>> > > 5000-6000 open ports. Now as through those ports my agent machine can
>> not
>> > > take any requests so executor is getting timed out and restarting the
>> > > executor again and again after every 1 min of time limit.
>> > >
>> > > I could not find out where exactly that random port is assigned. Is
>> there
>> > > any socket connection that we can change to get executor connection
>> > happen
>> > > on desired range of ports? Please let me know if my understanding is
>> > > correct and how can I change those ports for executor registration.
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Aug 31, 2016 at 3:09 AM, haosdent <ha...@gmail.com> wrote:
>> > >
>> > > > >I0829 14:27:38.322805  2700 slave.cpp:4307] *Terminating executor
>> > > > ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did not
>> register
>> > > > within 1mins
>> > > >
>> > > > This log looks wired. Could you find anything in the stdout/stderr
>> of
>> > the
>> > > > executor. For the executor 'test.1fb85a35-6e16-11e6-bec9-
>> c27afc834a0c'
>> > > > above, it should be under the folder '/tmp/mesos/slaves/d6f0e3e2-
>> > > > d144-4275-9d38-82327408622b-S12/frameworks/c796100f-9ecb-
>> > > > 46fa-90a2-72ad649c5dd3-0000/executors/test.1fb85a35-6e16-
>> > > > 11e6-bec9-c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/'
>> > > >
>> > > > Apart from that, run mesos under '/tmp' is not recommended.
>> > > >
>> > > > On Tue, Aug 30, 2016 at 2:32 AM, Pankaj Saha <psaha4@binghamton.edu
>> >
>> > > > wrote:
>> > > >
>> > > > > here is the log:
>> > > > >
>> > > > >
>> > > > >
>> > > > > I0829 14:24:21.727960 2679 main.cpp:223] Build: 2016-08-28
>> 13:39:46
>> > by
>> > > > > root
>> > > > > I0829 14:24:21.728159  2679 main.cpp:225] Version: 0.28.2
>> > > > > I0829 14:24:21.733256  2679 containerizer.cpp:149] Using
>> isolation:
>> > > > > posix/cpu,posix/mem,filesystem/posix
>> > > > > I0829 14:24:21.738895  2679 linux_launcher.cpp:101] Using
>> > > > > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux
>> > launcher
>> > > > > I0829 14:24:21.748019  2679 main.cpp:328] Starting Mesos slave
>> > > > > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
>> > > > > 128.226.116.69:8082
>> > > > > I0829 14:24:21.750114  2679 slave.cpp:194] Flags at startup:
>> > > > > --advertise_ip="128.226.116.69" --appc_simple_discovery_uri_
>> > > > > prefix="http://"
>> > > > > --appc_store_dir="/tmp/mesos/store/appc"
>> --authenticatee="crammd5"
>> > > > > --cgroups_cpu_enable_pids_and_tids_count="false"
>> > > > > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> > > > > --cgroups_limit_swap="false" --cgroups_root="mesos"
>> > > > > --container_disk_watch_interval="15secs" --containerizers="mesos"
>> > > > > --default_role="*" --disk_watch_interval="1mins" --docker="docker"
>> > > > > --docker_kill_orphans="true" --docker_registry="https://
>> > > > > registry-1.docker.io"
>> > > > > --docker_remove_delay="6hrs" --docker_socket="/var/run/dock
>> er.sock"
>> > > > > --docker_stop_timeout="0ns" --docker_store_dir="/tmp/
>> > > mesos/store/docker"
>> > > > > --enforce_container_disk_quota="false"
>> > > > > --executor_registration_timeout="1mins"
>> > > > > --executor_shutdown_grace_period="5secs"
>> > > > > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> > > > > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> > > > > --hadoop_home="" --help="false" --hostname_lookup="true"
>> > > > > --image_provisioner_backend="copy" --initialize_driver_logging="
>> > true"
>> > > > > --isolation="posix/cpu,posix/mem"
>> > > > > --launcher_dir="/home/pankaj/mesos-0.28.2/build/src"
>> > --logbufsecs="0"
>> > > > > --logging_level="INFO" --master="129.114.110.143:5050"
>> > > > > --oversubscribed_resources_interval="15secs"
>> > --perf_duration="10secs"
>> > > > > --perf_interval="1mins" --port="8082"
>> --qos_correction_interval_min=
>> > > > "0ns"
>> > > > > --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> > > > > --registration_backoff_factor="1secs"
>> --revocable_cpu_low_priority="
>> > > > true"
>> > > > > --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
>> > > > > --switch_user="true" --systemd_enable_support="true"
>> > > > > --systemd_runtime_directory="/run/systemd/system"
>> --version="false"
>> > > > > --work_dir="/tmp/mesos"
>> > > > > I0829 14:24:21.753572  2679 slave.cpp:464] Slave resources:
>> > cpus(*):2;
>> > > > > mem(*):2855; disk(*):84691; ports(*):[8081-8081]
>> > > > > I0829 14:24:21.753706  2679 slave.cpp:472] Slave attributes: [  ]
>> > > > > I0829 14:24:21.753762  2679 slave.cpp:477] Slave hostname:
>> > > > > venom.cs.binghamton.edu
>> > > > > I0829 14:24:21.770992 2696 state.cpp:58] Recovering state from
>> > > > > '/tmp/mesos/meta'
>> > > > > I0829 14:24:21.771304  2696 state.cpp:698] No checkpointed
>> resources
>> > > > found
>> > > > > at '/tmp/mesos/meta/resources/resources.info'
>> > > > > I0829 14:24:21.771644  2696 state.cpp:101] Failed to find the
>> latest
>> > > > slave
>> > > > > from '/tmp/mesos/meta'
>> > > > > I0829 14:24:21.772583  2696 status_update_manager.cpp:200]
>> Recovering
>> > > > > status update manager
>> > > > > I0829 14:24:21.773082  2698 containerizer.cpp:407] Recovering
>> > > > containerizer
>> > > > > I0829 14:24:21.777489  2702 provisioner.cpp:245] Provisioner
>> recovery
>> > > > > complete
>> > > > > I0829 14:24:21.778149  2699 slave.cpp:4550] Finished recovery
>> > > > > I0829 14:24:21.779564  2699 slave.cpp:796] New master detected at
>> > > > > master@129.114.110.143:5050
>> > > > > I0829 14:24:21.779742 2697 status_update_manager.cpp:174] Pausing
>> > > > sending
>> > > > > status updates
>> > > > > I0829 14:24:21.780607  2699 slave.cpp:821] No credentials
>> provided.
>> > > > > Attempting to register without authentication
>> > > > > I0829 14:24:21.781394  2699 slave.cpp:832] Detecting new master
>> > > > > I0829 14:24:22.698812  2702 slave.cpp:971] Registered with master
>> > > > > master@129.114.110.143:5050; given slave ID
>> > > > > d6f0e3e2-d144-4275-9d38-82327408622b-S12
>> > > > > I0829 14:24:22.699113  2698 status_update_manager.cpp:181]
>> Resuming
>> > > > sending
>> > > > > status updates
>> > > > > I0829 14:24:22.700258  2702 slave.cpp:1030] Forwarding total
>> > > > oversubscribed
>> > > > > resources
>> > > > > I0829 14:24:43.638958  2695 http.cpp:190] HTTP GET for
>> > /slave(1)/state
>> > > > from
>> > > > > 128.226.119.78:59261 with User-Agent='Mozilla/5.0 (X11; Linux
>> > x86_64)
>> > > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86
>> > > Safari/537.36'
>> > > > > I0829 14:25:21.764268  2702 slave.cpp:4359] Current disk usage
>> 9.67%.
>> > > Max
>> > > > > allowed age: 5.622868987169502days
>> > > > > I0829 14:26:21.778849  2695 slave.cpp:4359] Current disk usage
>> 9.67%.
>> > > Max
>> > > > > allowed age: 5.622860462326585days
>> > > > > I0829 14:26:38.271085  2698 slave.cpp:1361] Got assigned task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:26:38.311063  2698 slave.cpp:1480] Launching task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:26:38.314755  2698 paths.cpp:528] Trying to chown
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
>> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
>> > > > > to user 'root'
>> > > > > I0829 14:26:38.320300  2698 slave.cpp:5352] Launching executor
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
>> > cpus(*):0.1;
>> > > > > mem(*):32 in work directory
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
>> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
>> > > > > I0829 14:26:38.321523  2702 containerizer.cpp:666] Starting
>> container
>> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' for executor
>> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > I0829 14:26:38.322588  2698 slave.cpp:1698] Queuing task
>> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' for executor
>> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:26:38.358906  2702 linux_launcher.cpp:304] Cloning child
>> > > process
>> > > > > with flags =
>> > > > > I0829 14:26:38.366492  2702 containerizer.cpp:1179] Checkpointing
>> > > > > executor's forked pid 2758 to
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
>> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/
>> > > pids/forked.pid'
>> > > > > I0829 14:27:21.779755  2701 slave.cpp:4359] Current disk usage
>> 9.67%.
>> > > Max
>> > > > > allowed age: 5.622850415190289days
>> > > > > I0829 14:27:38.322805  2700 slave.cpp:4307]
>> > > > > *Terminating executor ''test.1fb85a35-6e16-11e6-bec9
>> -c27afc834a0c'
>> > of
>> > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it
>> did
>> > > not
>> > > > > register within 1minsI0829 14:27:38.323226 2700
>> > containerizer.cpp:1453]
>> > > > > Destroying container 'dff399f0-beb1-4c49-bd8e-c19621de2f71'*
>> > > > > I0829 14:27:38.329186  2702 cgroups.cpp:2427] Freezing cgroup
>> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > > > I0829 14:27:38.331509  2699 cgroups.cpp:1409] Successfully froze
>> > cgroup
>> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > after
>> > > > > 2.19392ms
>> > > > > I0829 14:27:38.334520  2698 cgroups.cpp:2445] Thawing cgroup
>> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > > > I0829 14:27:38.337821  2698 cgroups.cpp:1438] Successfullly thawed
>> > > cgroup
>> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > after
>> > > > > 3.194112ms
>> > > > > I0829 14:27:38.435214  2696 containerizer.cpp:1689] Executor for
>> > > > container
>> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' has exited
>> > > > > I0829 14:27:38.441556  2695 provisioner.cpp:306] Ignoring destroy
>> > > request
>> > > > > for unknown container dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > > > I0829 14:27:38.442186  2695 slave.cpp:3871] Executor
>> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 terminated with signal
>> > > Killed
>> > > > > I0829 14:27:38.445689  2695 slave.cpp:3012] Handling status update
>> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 from @0.0.0.0:0
>> > > > > W0829 14:27:38.447599  2702 containerizer.cpp:1295] Ignoring
>> update
>> > for
>> > > > > unknown container: dff399f0-beb1-4c49-bd8e-c19621de2f71
>> > > > > I0829 14:27:38.448391  2702 status_update_manager.cpp:320]
>> Received
>> > > > status
>> > > > > update TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d)
>> for
>> > > task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.449525  2702 status_update_manager.cpp:824]
>> > > Checkpointing
>> > > > > UPDATE for status update TASK_FAILED (UUID:
>> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.523027  2696 slave.cpp:3410] Forwarding the update
>> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 to
>> > > master@129.114.110.143:5050
>> > > > > I0829 14:27:38.627722  2698 status_update_manager.cpp:392]
>> Received
>> > > > status
>> > > > > update acknowledgement (UUID: 154a67ea-f6d5-4e9c-ad3d-9b0916
>> 1ba34d)
>> > > for
>> > > > > task test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.627943  2698 status_update_manager.cpp:824]
>> > > Checkpointing
>> > > > > ACK for status update TASK_FAILED (UUID:
>> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
>> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.698822  2701 slave.cpp:3975] Cleaning up executor
>> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.699582  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
>> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
>> > > > > for gc 6.99999190486222days in the future
>> > > > > I0829 14:27:38.700202  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
>> > > > > for gc 6.99999190029037days in the future
>> > > > > I0829 14:27:38.700382  2701 slave.cpp:4063] Cleaning up framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:38.700443  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
>> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71'
>> > > > > for gc 6.99999189796148days in the future
>> > > > > I0829 14:27:38.700649  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
>> > c27afc834a0c'
>> > > > > for gc 6.99999189622815days in the future
>> > > > > I0829 14:27:38.700845  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > for gc 6.99999189143704days in the future
>> > > > > I0829 14:27:38.701015  2701 status_update_manager.cpp:282] Closing
>> > > status
>> > > > > update streams for framework c796100f-9ecb-46fa-90a2-
>> > 72ad649c5dd3-0000
>> > > > > I0829 14:27:38.701161  2698 gc.cpp:55] Scheduling
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > 72ad649c5dd3-0000'
>> > > > > for gc 6.9999918900237days in the future
>> > > > > I0829 14:27:39.651463  2697 slave.cpp:1361] Got assigned task
>> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:39.655815  2696 gc.cpp:83] Unscheduling
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > from gc
>> > > > > I0829 14:27:39.656445  2696 gc.cpp:83] Unscheduling
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > 72ad649c5dd3-0000'
>> > > > > from gc
>> > > > > I0829 14:27:39.656855  2702 slave.cpp:1480] Launching task
>> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:39.660585  2702 paths.cpp:528] Trying to chown
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
>> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
>> > > > > to user 'root'
>> > > > > I0829 14:27:39.666008  2702 slave.cpp:5352] Launching executor
>> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
>> > cpus(*):0.1;
>> > > > > mem(*):32 in work directory
>> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
>> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
>> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
>> > > > > I0829 14:27:39.667603  2702 slave.cpp:1698] Queuing task
>> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' for executor
>> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > I0829 14:27:39.668207  2702 containerizer.cpp:666] Starting
>> container
>> > > > > 'ea676570-0a2a-49c3-a75c-14e045eb842b' for executor
>> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
>> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > I0829 14:27:39.678665  2702 linux_launcher.cpp:304] Cloning child
>> > > process
>> > > > > with flags =
>> > > > > I0829 14:27:39.681824  2702 containerizer.cpp:1179] Checkpointing
>> > > > > executor's forked pid 2799 to
>> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > 72ad649c5dd3-0000/executors/test.445696e6-6e16-11e6-bec9-
>> > > > > c27afc834a0c/runs/ea676570-0a2a-49c3-a75c-14e045eb842b/
>> > > pids/forked.pid'
>> > > > >
>> > > > > Thanks
>> > > > > Pankaj
>> > > > >
>> > > > >
>> > > > > On Mon, Aug 29, 2016 at 7:25 AM, haosdent <ha...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hi, @Pankaj, Could you provide logs during " the job is getting
>> > > > restarted
>> > > > > > and a new container is created with a new process id. ". The
>> logs
>> > you
>> > > > > > provided looks normal.
>> > > > > >
>> > > > > > On Mon, Aug 29, 2016 at 5:26 AM, Pankaj Saha <
>> > psaha4@binghamton.edu>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi
>> > > > > > > I am facing an issue with a launched jobs into my mesos
>> agents. I
>> > > am
>> > > > > > trying
>> > > > > > > to launch a job through marathon framework and job is staying
>> in
>> > > > > stagged
>> > > > > > > state and not running.
>> > > > > > > I could see the log message at the agent console as below:
>> > > > > > >
>> > > > > > >  Scheduling
>> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
>> > > > > > 9d38-82327408622b-S8/
>> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > > > for gc 6.99999884239407days in the future
>> > > > > > > I0828 16:20:36.053483 28512 slave.cpp:1361] *Got assigned task
>> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
>> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > > > I0828 16:20:36.056224 28510 gc.cpp:83] Unscheduling
>> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > 72ad649c5dd3-0000'
>> > > > > > > from gc
>> > > > > > > I0828 16:20:36.056715 28510 gc.cpp:83] Unscheduling
>> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
>> > > > > > 9d38-82327408622b-S8/
>> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > > > from gc
>> > > > > > > I0828 16:20:36.057231 28509 slave.cpp:1480] *Launching task
>> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
>> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > > > I0828 16:20:36.058661 28509 paths.cpp:528]* Trying to chown*
>> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
>> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
>> > 13adb145c12d'
>> > > > > > > to user 'root'
>> > > > > > > I0828 16:20:36.067807 28509 slave.cpp:5352]* Launching
>> executor
>> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
>> > > > > > > of framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with
>> > > > resources
>> > > > > > > cpus(*):0.1; mem(*):32 in work directory
>> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
>> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
>> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
>> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
>> > 13adb145c12d'
>> > > > > > > I0828 16:20:36.069314 28509 slave.cpp:1698] *Queuing task
>> > > > > > > 'test-crixus.*eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
>> > > > > > > for executor 'test-crixus.eb66a42b-6d5c-11e
>> 6-bec9-c27afc834a0c'
>> > of
>> > > > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
>> > > > > > > I0828 16:20:36.069902 28509 containerizer.cpp:666] *Starting
>> > > > container*
>> > > > > > > '99620406-87b5-406c-a88b-13adb145c12d' for executor
>> > > > > > > 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of
>> framework
>> > > > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
>> > > > > > > I0828 16:20:36.080713 28509 linux_launcher.cpp:304] *Cloning
>> > child
>> > > > > > process*
>> > > > > > > with flags =
>> > > > > > > I0828 16:20:36.084738 28509 containerizer.cpp:1179]
>> > *Checkpointing
>> > > > > > > executor's forked pid 29629* to
>> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
>> > > > > > 9d38-82327408622b-S8/
>> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
>> > > > > > > executors/test-crixus.eb66a42b-6d5c-11e6-bec9-
>> > > > > > c27afc834a0c/runs/99620406-
>> > > > > > > 87b5-406c-a88b-13adb145c12d/pids/forked.pid'
>> > > > > > >
>> > > > > > >
>> > > > > > > But after that, the job is getting restarted and a new
>> container
>> > is
>> > > > > > created
>> > > > > > > with a new process id. It happening infinitely which is
>> keeping
>> > the
>> > > > job
>> > > > > > in
>> > > > > > > stagged state to mesos-master.
>> > > > > > >
>> > > > > > > This job is nothing but a simle echo "hello world" kind of
>> shell
>> > > > > command.
>> > > > > > > Can anyone please point out where its failing or I am doing
>> > wrong.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Thanks
>> > > > > > > Pankaj
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Best Regards,
>> > > > > > Haosdent Huang
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Best Regards,
>> > > > Haosdent Huang
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>

Re: jobs are stuck in agents and staying in stagged state

Posted by Pankaj Saha <ps...@binghamton.edu>.
I am using my local system as a slave which is behind campus
firewall. Firewall allows me to access only 8082 to 8090 ports for all
incoming requests. My slave system is having a private ip address. So I am
using a NAT conversion at the firewall which will convert all incoming
requests to ports 8080-8090 ports on the firewall to my private ip machine.

Firewall has a public ip: 128.x.y.z  and  mesos slave has a private ip:
10.11.12.13

Here is the NAT rule:   128.x.y.z port 8082-8090 will change into
10.11.12.13 port 8082-8090
I am using firewalls address as  *--advertise_ip=* *128.x.y.z *address,
which is registering  *128.x.y.z:8082* as a slave. Now jobs are coming to
my private ip machine through that NAT conversion but containers are
failing and relaunching again and again.

Mesos Master is running on 5050
Mesos slave is running on 8082
Mesos ports as resources are (8083-8084).
Linux ephemeral port range is changed in slave machine to (8085-8090).

I hope this explains my setup. I am planning to make changes if required in
the source code to make this kind of set up working.






On Thu, Sep 1, 2016 at 11:04 AM, haosdent <ha...@gmail.com> wrote:

> If you use Linux, you could execute follow command on every Mesos Agent to
> make the ephemeral port assigned during 5000~6000
>
> echo "5000 6000" > /proc/sys/net/ipv4/ip_local_port_range
>
> On Thu, Sep 1, 2016 at 3:12 PM, Vinod Kone <vi...@apache.org> wrote:
>
> > AFAICT, your agent is listening on port 8082 and not the default 5051.
> > ---------
> > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > 128.226.116.69:8082
> > ------------
> >
> > The fact that agent is receiving a task from the master means that the
> > firewall on the agent allows incoming connections to 8082. So I'm
> surprised
> > that a local connection from the executor to the agent is being denied.
> > What exactly are your firewall rules on the agent?
> >
> > Also, can you share the stderr/stdout of an example executor?
> >
> >
> > On Wed, Aug 31, 2016 at 6:18 PM, Pankaj Saha <ps...@binghamton.edu>
> > wrote:
> >
> > > I think the executor wants to get registered by communicating with
> > > mesos master and it fails due to network restriction.
> > > How can I change the /tmp/ path? I have mentioned /var/lib/mesos  as my
> > > work_dir.
> > >
> > >
> > > *I am explaining my setup here:*
> > >  I have a Mesos setup where master and slave both are running on the
> same
> > > network of my university campus. Mesos agent node is situated under a
> > > firewall and only port: 5000 to port:6000 are open for incoming traffic
> > > whereas Mesos master has no such restrictions. I am running master
> > service
> > > on master:5050 and agent is running on agent:5051 as default.
> > >
> > > I can see agent is communicating correctly to master and offering the
> > > available resources. I have mentioned the available ports for agents
> are
> > > ports:[5001-6000] in *src/slave/constants.cpp* file so that framework
> can
> > > communicate only through those ports which are open for my agent system
> > > behind the firewall.
> > >
> > > Now when I am launching jobs through Mesosphere marathon framework, I
> can
> > > see all jobs are connected to mesos-agent through those mentioned port
> > > ranges[5001-6000]. But my jobs are not getting submitted. So I started
> > > debugging and realised that when launching jobs mesos slaves create and
> > > launch an executor (*/erc/executor/executor.cpp*) which communicates to
> > the
> > > mesos master through a random port. Which is outside my available range
> > of
> > > 5000-6000 open ports. Now as through those ports my agent machine can
> not
> > > take any requests so executor is getting timed out and restarting the
> > > executor again and again after every 1 min of time limit.
> > >
> > > I could not find out where exactly that random port is assigned. Is
> there
> > > any socket connection that we can change to get executor connection
> > happen
> > > on desired range of ports? Please let me know if my understanding is
> > > correct and how can I change those ports for executor registration.
> > >
> > >
> > >
> > >
> > > On Wed, Aug 31, 2016 at 3:09 AM, haosdent <ha...@gmail.com> wrote:
> > >
> > > > >I0829 14:27:38.322805  2700 slave.cpp:4307] *Terminating executor
> > > > ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did not
> register
> > > > within 1mins
> > > >
> > > > This log looks wired. Could you find anything in the stdout/stderr of
> > the
> > > > executor. For the executor 'test.1fb85a35-6e16-11e6-bec9-
> c27afc834a0c'
> > > > above, it should be under the folder '/tmp/mesos/slaves/d6f0e3e2-
> > > > d144-4275-9d38-82327408622b-S12/frameworks/c796100f-9ecb-
> > > > 46fa-90a2-72ad649c5dd3-0000/executors/test.1fb85a35-6e16-
> > > > 11e6-bec9-c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/'
> > > >
> > > > Apart from that, run mesos under '/tmp' is not recommended.
> > > >
> > > > On Tue, Aug 30, 2016 at 2:32 AM, Pankaj Saha <ps...@binghamton.edu>
> > > > wrote:
> > > >
> > > > > here is the log:
> > > > >
> > > > >
> > > > >
> > > > > I0829 14:24:21.727960 2679 main.cpp:223] Build: 2016-08-28
> 13:39:46
> > by
> > > > > root
> > > > > I0829 14:24:21.728159  2679 main.cpp:225] Version: 0.28.2
> > > > > I0829 14:24:21.733256  2679 containerizer.cpp:149] Using isolation:
> > > > > posix/cpu,posix/mem,filesystem/posix
> > > > > I0829 14:24:21.738895  2679 linux_launcher.cpp:101] Using
> > > > > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux
> > launcher
> > > > > I0829 14:24:21.748019  2679 main.cpp:328] Starting Mesos slave
> > > > > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > > > > 128.226.116.69:8082
> > > > > I0829 14:24:21.750114  2679 slave.cpp:194] Flags at startup:
> > > > > --advertise_ip="128.226.116.69" --appc_simple_discovery_uri_
> > > > > prefix="http://"
> > > > > --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5"
> > > > > --cgroups_cpu_enable_pids_and_tids_count="false"
> > > > > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> > > > > --cgroups_limit_swap="false" --cgroups_root="mesos"
> > > > > --container_disk_watch_interval="15secs" --containerizers="mesos"
> > > > > --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> > > > > --docker_kill_orphans="true" --docker_registry="https://
> > > > > registry-1.docker.io"
> > > > > --docker_remove_delay="6hrs" --docker_socket="/var/run/
> docker.sock"
> > > > > --docker_stop_timeout="0ns" --docker_store_dir="/tmp/
> > > mesos/store/docker"
> > > > > --enforce_container_disk_quota="false"
> > > > > --executor_registration_timeout="1mins"
> > > > > --executor_shutdown_grace_period="5secs"
> > > > > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> > > > > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> > > > > --hadoop_home="" --help="false" --hostname_lookup="true"
> > > > > --image_provisioner_backend="copy" --initialize_driver_logging="
> > true"
> > > > > --isolation="posix/cpu,posix/mem"
> > > > > --launcher_dir="/home/pankaj/mesos-0.28.2/build/src"
> > --logbufsecs="0"
> > > > > --logging_level="INFO" --master="129.114.110.143:5050"
> > > > > --oversubscribed_resources_interval="15secs"
> > --perf_duration="10secs"
> > > > > --perf_interval="1mins" --port="8082"
> --qos_correction_interval_min=
> > > > "0ns"
> > > > > --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> > > > > --registration_backoff_factor="1secs"
> --revocable_cpu_low_priority="
> > > > true"
> > > > > --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> > > > > --switch_user="true" --systemd_enable_support="true"
> > > > > --systemd_runtime_directory="/run/systemd/system"
> --version="false"
> > > > > --work_dir="/tmp/mesos"
> > > > > I0829 14:24:21.753572  2679 slave.cpp:464] Slave resources:
> > cpus(*):2;
> > > > > mem(*):2855; disk(*):84691; ports(*):[8081-8081]
> > > > > I0829 14:24:21.753706  2679 slave.cpp:472] Slave attributes: [  ]
> > > > > I0829 14:24:21.753762  2679 slave.cpp:477] Slave hostname:
> > > > > venom.cs.binghamton.edu
> > > > > I0829 14:24:21.770992 2696 state.cpp:58] Recovering state from
> > > > > '/tmp/mesos/meta'
> > > > > I0829 14:24:21.771304  2696 state.cpp:698] No checkpointed
> resources
> > > > found
> > > > > at '/tmp/mesos/meta/resources/resources.info'
> > > > > I0829 14:24:21.771644  2696 state.cpp:101] Failed to find the
> latest
> > > > slave
> > > > > from '/tmp/mesos/meta'
> > > > > I0829 14:24:21.772583  2696 status_update_manager.cpp:200]
> Recovering
> > > > > status update manager
> > > > > I0829 14:24:21.773082  2698 containerizer.cpp:407] Recovering
> > > > containerizer
> > > > > I0829 14:24:21.777489  2702 provisioner.cpp:245] Provisioner
> recovery
> > > > > complete
> > > > > I0829 14:24:21.778149  2699 slave.cpp:4550] Finished recovery
> > > > > I0829 14:24:21.779564  2699 slave.cpp:796] New master detected at
> > > > > master@129.114.110.143:5050
> > > > > I0829 14:24:21.779742 2697 status_update_manager.cpp:174] Pausing
> > > > sending
> > > > > status updates
> > > > > I0829 14:24:21.780607  2699 slave.cpp:821] No credentials provided.
> > > > > Attempting to register without authentication
> > > > > I0829 14:24:21.781394  2699 slave.cpp:832] Detecting new master
> > > > > I0829 14:24:22.698812  2702 slave.cpp:971] Registered with master
> > > > > master@129.114.110.143:5050; given slave ID
> > > > > d6f0e3e2-d144-4275-9d38-82327408622b-S12
> > > > > I0829 14:24:22.699113  2698 status_update_manager.cpp:181] Resuming
> > > > sending
> > > > > status updates
> > > > > I0829 14:24:22.700258  2702 slave.cpp:1030] Forwarding total
> > > > oversubscribed
> > > > > resources
> > > > > I0829 14:24:43.638958  2695 http.cpp:190] HTTP GET for
> > /slave(1)/state
> > > > from
> > > > > 128.226.119.78:59261 with User-Agent='Mozilla/5.0 (X11; Linux
> > x86_64)
> > > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86
> > > Safari/537.36'
> > > > > I0829 14:25:21.764268  2702 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622868987169502days
> > > > > I0829 14:26:21.778849  2695 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622860462326585days
> > > > > I0829 14:26:38.271085  2698 slave.cpp:1361] Got assigned task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.311063  2698 slave.cpp:1480] Launching task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.314755  2698 paths.cpp:528] Trying to chown
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > to user 'root'
> > > > > I0829 14:26:38.320300  2698 slave.cpp:5352] Launching executor
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> > cpus(*):0.1;
> > > > > mem(*):32 in work directory
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > I0829 14:26:38.321523  2702 containerizer.cpp:666] Starting
> container
> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' for executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > I0829 14:26:38.322588  2698 slave.cpp:1698] Queuing task
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:26:38.358906  2702 linux_launcher.cpp:304] Cloning child
> > > process
> > > > > with flags =
> > > > > I0829 14:26:38.366492  2702 containerizer.cpp:1179] Checkpointing
> > > > > executor's forked pid 2758 to
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/
> > > pids/forked.pid'
> > > > > I0829 14:27:21.779755  2701 slave.cpp:4359] Current disk usage
> 9.67%.
> > > Max
> > > > > allowed age: 5.622850415190289days
> > > > > I0829 14:27:38.322805  2700 slave.cpp:4307]
> > > > > *Terminating executor ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > of
> > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it
> did
> > > not
> > > > > register within 1minsI0829 14:27:38.323226 2700
> > containerizer.cpp:1453]
> > > > > Destroying container 'dff399f0-beb1-4c49-bd8e-c19621de2f71'*
> > > > > I0829 14:27:38.329186  2702 cgroups.cpp:2427] Freezing cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.331509  2699 cgroups.cpp:1409] Successfully froze
> > cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > after
> > > > > 2.19392ms
> > > > > I0829 14:27:38.334520  2698 cgroups.cpp:2445] Thawing cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.337821  2698 cgroups.cpp:1438] Successfullly thawed
> > > cgroup
> > > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > after
> > > > > 3.194112ms
> > > > > I0829 14:27:38.435214  2696 containerizer.cpp:1689] Executor for
> > > > container
> > > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' has exited
> > > > > I0829 14:27:38.441556  2695 provisioner.cpp:306] Ignoring destroy
> > > request
> > > > > for unknown container dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.442186  2695 slave.cpp:3871] Executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 terminated with signal
> > > Killed
> > > > > I0829 14:27:38.445689  2695 slave.cpp:3012] Handling status update
> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 from @0.0.0.0:0
> > > > > W0829 14:27:38.447599  2702 containerizer.cpp:1295] Ignoring update
> > for
> > > > > unknown container: dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > > I0829 14:27:38.448391  2702 status_update_manager.cpp:320] Received
> > > > status
> > > > > update TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d)
> for
> > > task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.449525  2702 status_update_manager.cpp:824]
> > > Checkpointing
> > > > > UPDATE for status update TASK_FAILED (UUID:
> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.523027  2696 slave.cpp:3410] Forwarding the update
> > > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 to
> > > master@129.114.110.143:5050
> > > > > I0829 14:27:38.627722  2698 status_update_manager.cpp:392] Received
> > > > status
> > > > > update acknowledgement (UUID: 154a67ea-f6d5-4e9c-ad3d-
> 9b09161ba34d)
> > > for
> > > > > task test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.627943  2698 status_update_manager.cpp:824]
> > > Checkpointing
> > > > > ACK for status update TASK_FAILED (UUID:
> > > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.698822  2701 slave.cpp:3975] Cleaning up executor
> > > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.699582  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > for gc 6.99999190486222days in the future
> > > > > I0829 14:27:38.700202  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > > > for gc 6.99999190029037days in the future
> > > > > I0829 14:27:38.700382  2701 slave.cpp:4063] Cleaning up framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:38.700443  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > > for gc 6.99999189796148days in the future
> > > > > I0829 14:27:38.700649  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > c27afc834a0c'
> > > > > for gc 6.99999189622815days in the future
> > > > > I0829 14:27:38.700845  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > for gc 6.99999189143704days in the future
> > > > > I0829 14:27:38.701015  2701 status_update_manager.cpp:282] Closing
> > > status
> > > > > update streams for framework c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000
> > > > > I0829 14:27:38.701161  2698 gc.cpp:55] Scheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000'
> > > > > for gc 6.9999918900237days in the future
> > > > > I0829 14:27:39.651463  2697 slave.cpp:1361] Got assigned task
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.655815  2696 gc.cpp:83] Unscheduling
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0829 14:27:39.656445  2696 gc.cpp:83] Unscheduling
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > 72ad649c5dd3-0000'
> > > > > from gc
> > > > > I0829 14:27:39.656855  2702 slave.cpp:1480] Launching task
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.660585  2702 paths.cpp:528] Trying to chown
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > > to user 'root'
> > > > > I0829 14:27:39.666008  2702 slave.cpp:5352] Launching executor
> > > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> > cpus(*):0.1;
> > > > > mem(*):32 in work directory
> > > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > > I0829 14:27:39.667603  2702 slave.cpp:1698] Queuing task
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > I0829 14:27:39.668207  2702 containerizer.cpp:666] Starting
> container
> > > > > 'ea676570-0a2a-49c3-a75c-14e045eb842b' for executor
> > > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > I0829 14:27:39.678665  2702 linux_launcher.cpp:304] Cloning child
> > > process
> > > > > with flags =
> > > > > I0829 14:27:39.681824  2702 containerizer.cpp:1179] Checkpointing
> > > > > executor's forked pid 2799 to
> > > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > 72ad649c5dd3-0000/executors/test.445696e6-6e16-11e6-bec9-
> > > > > c27afc834a0c/runs/ea676570-0a2a-49c3-a75c-14e045eb842b/
> > > pids/forked.pid'
> > > > >
> > > > > Thanks
> > > > > Pankaj
> > > > >
> > > > >
> > > > > On Mon, Aug 29, 2016 at 7:25 AM, haosdent <ha...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi, @Pankaj, Could you provide logs during " the job is getting
> > > > restarted
> > > > > > and a new container is created with a new process id. ". The logs
> > you
> > > > > > provided looks normal.
> > > > > >
> > > > > > On Mon, Aug 29, 2016 at 5:26 AM, Pankaj Saha <
> > psaha4@binghamton.edu>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > > I am facing an issue with a launched jobs into my mesos
> agents. I
> > > am
> > > > > > trying
> > > > > > > to launch a job through marathon framework and job is staying
> in
> > > > > stagged
> > > > > > > state and not running.
> > > > > > > I could see the log message at the agent console as below:
> > > > > > >
> > > > > > >  Scheduling
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > for gc 6.99999884239407days in the future
> > > > > > > I0828 16:20:36.053483 28512 slave.cpp:1361] *Got assigned task
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.056224 28510 gc.cpp:83] Unscheduling
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000'
> > > > > > > from gc
> > > > > > > I0828 16:20:36.056715 28510 gc.cpp:83] Unscheduling
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > from gc
> > > > > > > I0828 16:20:36.057231 28509 slave.cpp:1480] *Launching task
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.058661 28509 paths.cpp:528]* Trying to chown*
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> > 13adb145c12d'
> > > > > > > to user 'root'
> > > > > > > I0828 16:20:36.067807 28509 slave.cpp:5352]* Launching executor
> > > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > > of framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with
> > > > resources
> > > > > > > cpus(*):0.1; mem(*):32 in work directory
> > > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> > 13adb145c12d'
> > > > > > > I0828 16:20:36.069314 28509 slave.cpp:1698] *Queuing task
> > > > > > > 'test-crixus.*eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
> > > > > > > for executor 'test-crixus.eb66a42b-6d5c-
> 11e6-bec9-c27afc834a0c'
> > of
> > > > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > > I0828 16:20:36.069902 28509 containerizer.cpp:666] *Starting
> > > > container*
> > > > > > > '99620406-87b5-406c-a88b-13adb145c12d' for executor
> > > > > > > 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of
> framework
> > > > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > > I0828 16:20:36.080713 28509 linux_launcher.cpp:304] *Cloning
> > child
> > > > > > process*
> > > > > > > with flags =
> > > > > > > I0828 16:20:36.084738 28509 containerizer.cpp:1179]
> > *Checkpointing
> > > > > > > executor's forked pid 29629* to
> > > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > > 9d38-82327408622b-S8/
> > > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > > > executors/test-crixus.eb66a42b-6d5c-11e6-bec9-
> > > > > > c27afc834a0c/runs/99620406-
> > > > > > > 87b5-406c-a88b-13adb145c12d/pids/forked.pid'
> > > > > > >
> > > > > > >
> > > > > > > But after that, the job is getting restarted and a new
> container
> > is
> > > > > > created
> > > > > > > with a new process id. It happening infinitely which is keeping
> > the
> > > > job
> > > > > > in
> > > > > > > stagged state to mesos-master.
> > > > > > >
> > > > > > > This job is nothing but a simle echo "hello world" kind of
> shell
> > > > > command.
> > > > > > > Can anyone please point out where its failing or I am doing
> > wrong.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > > Pankaj
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > > Haosdent Huang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards,
> > > > Haosdent Huang
> > > >
> > >
> >
>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: jobs are stuck in agents and staying in stagged state

Posted by haosdent <ha...@gmail.com>.
If you use Linux, you could execute follow command on every Mesos Agent to
make the ephemeral port assigned during 5000~6000

echo "5000 6000" > /proc/sys/net/ipv4/ip_local_port_range

On Thu, Sep 1, 2016 at 3:12 PM, Vinod Kone <vi...@apache.org> wrote:

> AFAICT, your agent is listening on port 8082 and not the default 5051.
> ---------
> I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> 128.226.116.69:8082
> ------------
>
> The fact that agent is receiving a task from the master means that the
> firewall on the agent allows incoming connections to 8082. So I'm surprised
> that a local connection from the executor to the agent is being denied.
> What exactly are your firewall rules on the agent?
>
> Also, can you share the stderr/stdout of an example executor?
>
>
> On Wed, Aug 31, 2016 at 6:18 PM, Pankaj Saha <ps...@binghamton.edu>
> wrote:
>
> > I think the executor wants to get registered by communicating with
> > mesos master and it fails due to network restriction.
> > How can I change the /tmp/ path? I have mentioned /var/lib/mesos  as my
> > work_dir.
> >
> >
> > *I am explaining my setup here:*
> >  I have a Mesos setup where master and slave both are running on the same
> > network of my university campus. Mesos agent node is situated under a
> > firewall and only port: 5000 to port:6000 are open for incoming traffic
> > whereas Mesos master has no such restrictions. I am running master
> service
> > on master:5050 and agent is running on agent:5051 as default.
> >
> > I can see agent is communicating correctly to master and offering the
> > available resources. I have mentioned the available ports for agents  are
> > ports:[5001-6000] in *src/slave/constants.cpp* file so that framework can
> > communicate only through those ports which are open for my agent system
> > behind the firewall.
> >
> > Now when I am launching jobs through Mesosphere marathon framework, I can
> > see all jobs are connected to mesos-agent through those mentioned port
> > ranges[5001-6000]. But my jobs are not getting submitted. So I started
> > debugging and realised that when launching jobs mesos slaves create and
> > launch an executor (*/erc/executor/executor.cpp*) which communicates to
> the
> > mesos master through a random port. Which is outside my available range
> of
> > 5000-6000 open ports. Now as through those ports my agent machine can not
> > take any requests so executor is getting timed out and restarting the
> > executor again and again after every 1 min of time limit.
> >
> > I could not find out where exactly that random port is assigned. Is there
> > any socket connection that we can change to get executor connection
> happen
> > on desired range of ports? Please let me know if my understanding is
> > correct and how can I change those ports for executor registration.
> >
> >
> >
> >
> > On Wed, Aug 31, 2016 at 3:09 AM, haosdent <ha...@gmail.com> wrote:
> >
> > > >I0829 14:27:38.322805  2700 slave.cpp:4307] *Terminating executor
> > > ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did not register
> > > within 1mins
> > >
> > > This log looks wired. Could you find anything in the stdout/stderr of
> the
> > > executor. For the executor 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > above, it should be under the folder '/tmp/mesos/slaves/d6f0e3e2-
> > > d144-4275-9d38-82327408622b-S12/frameworks/c796100f-9ecb-
> > > 46fa-90a2-72ad649c5dd3-0000/executors/test.1fb85a35-6e16-
> > > 11e6-bec9-c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/'
> > >
> > > Apart from that, run mesos under '/tmp' is not recommended.
> > >
> > > On Tue, Aug 30, 2016 at 2:32 AM, Pankaj Saha <ps...@binghamton.edu>
> > > wrote:
> > >
> > > > here is the log:
> > > >
> > > >
> > > >
> > > > I0829 14:24:21.727960 2679 main.cpp:223] Build: 2016-08-28 13:39:46
> by
> > > > root
> > > > I0829 14:24:21.728159  2679 main.cpp:225] Version: 0.28.2
> > > > I0829 14:24:21.733256  2679 containerizer.cpp:149] Using isolation:
> > > > posix/cpu,posix/mem,filesystem/posix
> > > > I0829 14:24:21.738895  2679 linux_launcher.cpp:101] Using
> > > > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux
> launcher
> > > > I0829 14:24:21.748019  2679 main.cpp:328] Starting Mesos slave
> > > > I0829 14:24:21.750063  2679 slave.cpp:193] Slave started on 1)@
> > > > 128.226.116.69:8082
> > > > I0829 14:24:21.750114  2679 slave.cpp:194] Flags at startup:
> > > > --advertise_ip="128.226.116.69" --appc_simple_discovery_uri_
> > > > prefix="http://"
> > > > --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5"
> > > > --cgroups_cpu_enable_pids_and_tids_count="false"
> > > > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> > > > --cgroups_limit_swap="false" --cgroups_root="mesos"
> > > > --container_disk_watch_interval="15secs" --containerizers="mesos"
> > > > --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> > > > --docker_kill_orphans="true" --docker_registry="https://
> > > > registry-1.docker.io"
> > > > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock"
> > > > --docker_stop_timeout="0ns" --docker_store_dir="/tmp/
> > mesos/store/docker"
> > > > --enforce_container_disk_quota="false"
> > > > --executor_registration_timeout="1mins"
> > > > --executor_shutdown_grace_period="5secs"
> > > > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> > > > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> > > > --hadoop_home="" --help="false" --hostname_lookup="true"
> > > > --image_provisioner_backend="copy" --initialize_driver_logging="
> true"
> > > > --isolation="posix/cpu,posix/mem"
> > > > --launcher_dir="/home/pankaj/mesos-0.28.2/build/src"
> --logbufsecs="0"
> > > > --logging_level="INFO" --master="129.114.110.143:5050"
> > > > --oversubscribed_resources_interval="15secs"
> --perf_duration="10secs"
> > > > --perf_interval="1mins" --port="8082" --qos_correction_interval_min=
> > > "0ns"
> > > > --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> > > > --registration_backoff_factor="1secs" --revocable_cpu_low_priority="
> > > true"
> > > > --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> > > > --switch_user="true" --systemd_enable_support="true"
> > > > --systemd_runtime_directory="/run/systemd/system" --version="false"
> > > > --work_dir="/tmp/mesos"
> > > > I0829 14:24:21.753572  2679 slave.cpp:464] Slave resources:
> cpus(*):2;
> > > > mem(*):2855; disk(*):84691; ports(*):[8081-8081]
> > > > I0829 14:24:21.753706  2679 slave.cpp:472] Slave attributes: [  ]
> > > > I0829 14:24:21.753762  2679 slave.cpp:477] Slave hostname:
> > > > venom.cs.binghamton.edu
> > > > I0829 14:24:21.770992 2696 state.cpp:58] Recovering state from
> > > > '/tmp/mesos/meta'
> > > > I0829 14:24:21.771304  2696 state.cpp:698] No checkpointed resources
> > > found
> > > > at '/tmp/mesos/meta/resources/resources.info'
> > > > I0829 14:24:21.771644  2696 state.cpp:101] Failed to find the latest
> > > slave
> > > > from '/tmp/mesos/meta'
> > > > I0829 14:24:21.772583  2696 status_update_manager.cpp:200] Recovering
> > > > status update manager
> > > > I0829 14:24:21.773082  2698 containerizer.cpp:407] Recovering
> > > containerizer
> > > > I0829 14:24:21.777489  2702 provisioner.cpp:245] Provisioner recovery
> > > > complete
> > > > I0829 14:24:21.778149  2699 slave.cpp:4550] Finished recovery
> > > > I0829 14:24:21.779564  2699 slave.cpp:796] New master detected at
> > > > master@129.114.110.143:5050
> > > > I0829 14:24:21.779742 2697 status_update_manager.cpp:174] Pausing
> > > sending
> > > > status updates
> > > > I0829 14:24:21.780607  2699 slave.cpp:821] No credentials provided.
> > > > Attempting to register without authentication
> > > > I0829 14:24:21.781394  2699 slave.cpp:832] Detecting new master
> > > > I0829 14:24:22.698812  2702 slave.cpp:971] Registered with master
> > > > master@129.114.110.143:5050; given slave ID
> > > > d6f0e3e2-d144-4275-9d38-82327408622b-S12
> > > > I0829 14:24:22.699113  2698 status_update_manager.cpp:181] Resuming
> > > sending
> > > > status updates
> > > > I0829 14:24:22.700258  2702 slave.cpp:1030] Forwarding total
> > > oversubscribed
> > > > resources
> > > > I0829 14:24:43.638958  2695 http.cpp:190] HTTP GET for
> /slave(1)/state
> > > from
> > > > 128.226.119.78:59261 with User-Agent='Mozilla/5.0 (X11; Linux
> x86_64)
> > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86
> > Safari/537.36'
> > > > I0829 14:25:21.764268  2702 slave.cpp:4359] Current disk usage 9.67%.
> > Max
> > > > allowed age: 5.622868987169502days
> > > > I0829 14:26:21.778849  2695 slave.cpp:4359] Current disk usage 9.67%.
> > Max
> > > > allowed age: 5.622860462326585days
> > > > I0829 14:26:38.271085  2698 slave.cpp:1361] Got assigned task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:26:38.311063  2698 slave.cpp:1480] Launching task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c for framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:26:38.314755  2698 paths.cpp:528] Trying to chown
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > to user 'root'
> > > > I0829 14:26:38.320300  2698 slave.cpp:5352] Launching executor
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> cpus(*):0.1;
> > > > mem(*):32 in work directory
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > I0829 14:26:38.321523  2702 containerizer.cpp:666] Starting container
> > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' for executor
> > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > I0829 14:26:38.322588  2698 slave.cpp:1698] Queuing task
> > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:26:38.358906  2702 linux_launcher.cpp:304] Cloning child
> > process
> > > > with flags =
> > > > I0829 14:26:38.366492  2702 containerizer.cpp:1179] Checkpointing
> > > > executor's forked pid 2758 to
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71/
> > pids/forked.pid'
> > > > I0829 14:27:21.779755  2701 slave.cpp:4359] Current disk usage 9.67%.
> > Max
> > > > allowed age: 5.622850415190289days
> > > > I0829 14:27:38.322805  2700 slave.cpp:4307]
> > > > *Terminating executor ''test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> of
> > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000' because it did
> > not
> > > > register within 1minsI0829 14:27:38.323226 2700
> containerizer.cpp:1453]
> > > > Destroying container 'dff399f0-beb1-4c49-bd8e-c19621de2f71'*
> > > > I0829 14:27:38.329186  2702 cgroups.cpp:2427] Freezing cgroup
> > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > I0829 14:27:38.331509  2699 cgroups.cpp:1409] Successfully froze
> cgroup
> > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > after
> > > > 2.19392ms
> > > > I0829 14:27:38.334520  2698 cgroups.cpp:2445] Thawing cgroup
> > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > I0829 14:27:38.337821  2698 cgroups.cpp:1438] Successfullly thawed
> > cgroup
> > > > /sys/fs/cgroup/freezer/mesos/dff399f0-beb1-4c49-bd8e-c19621de2f71
> > after
> > > > 3.194112ms
> > > > I0829 14:27:38.435214  2696 containerizer.cpp:1689] Executor for
> > > container
> > > > 'dff399f0-beb1-4c49-bd8e-c19621de2f71' has exited
> > > > I0829 14:27:38.441556  2695 provisioner.cpp:306] Ignoring destroy
> > request
> > > > for unknown container dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > I0829 14:27:38.442186  2695 slave.cpp:3871] Executor
> > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 terminated with signal
> > Killed
> > > > I0829 14:27:38.445689  2695 slave.cpp:3012] Handling status update
> > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 from @0.0.0.0:0
> > > > W0829 14:27:38.447599  2702 containerizer.cpp:1295] Ignoring update
> for
> > > > unknown container: dff399f0-beb1-4c49-bd8e-c19621de2f71
> > > > I0829 14:27:38.448391  2702 status_update_manager.cpp:320] Received
> > > status
> > > > update TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for
> > task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.449525  2702 status_update_manager.cpp:824]
> > Checkpointing
> > > > UPDATE for status update TASK_FAILED (UUID:
> > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.523027  2696 slave.cpp:3410] Forwarding the update
> > > > TASK_FAILED (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 to
> > master@129.114.110.143:5050
> > > > I0829 14:27:38.627722  2698 status_update_manager.cpp:392] Received
> > > status
> > > > update acknowledgement (UUID: 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d)
> > for
> > > > task test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.627943  2698 status_update_manager.cpp:824]
> > Checkpointing
> > > > ACK for status update TASK_FAILED (UUID:
> > > > 154a67ea-f6d5-4e9c-ad3d-9b09161ba34d) for task
> > > > test.1fb85a35-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.698822  2701 slave.cpp:3975] Cleaning up executor
> > > > 'test.1fb85a35-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.699582  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > for gc 6.99999190486222days in the future
> > > > I0829 14:27:38.700202  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.1fb85a35-6e16-11e6-bec9-c27afc834a0c'
> > > > for gc 6.99999190029037days in the future
> > > > I0829 14:27:38.700382  2701 slave.cpp:4063] Cleaning up framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:38.700443  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> > > > c27afc834a0c/runs/dff399f0-beb1-4c49-bd8e-c19621de2f71'
> > > > for gc 6.99999189796148days in the future
> > > > I0829 14:27:38.700649  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000/executors/test.1fb85a35-6e16-11e6-bec9-
> c27afc834a0c'
> > > > for gc 6.99999189622815days in the future
> > > > I0829 14:27:38.700845  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > for gc 6.99999189143704days in the future
> > > > I0829 14:27:38.701015  2701 status_update_manager.cpp:282] Closing
> > status
> > > > update streams for framework c796100f-9ecb-46fa-90a2-
> 72ad649c5dd3-0000
> > > > I0829 14:27:38.701161  2698 gc.cpp:55] Scheduling
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> 72ad649c5dd3-0000'
> > > > for gc 6.9999918900237days in the future
> > > > I0829 14:27:39.651463  2697 slave.cpp:1361] Got assigned task
> > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:39.655815  2696 gc.cpp:83] Unscheduling
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > from gc
> > > > I0829 14:27:39.656445  2696 gc.cpp:83] Unscheduling
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> 72ad649c5dd3-0000'
> > > > from gc
> > > > I0829 14:27:39.656855  2702 slave.cpp:1480] Launching task
> > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c for framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:39.660585  2702 paths.cpp:528] Trying to chown
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > to user 'root'
> > > > I0829 14:27:39.666008  2702 slave.cpp:5352] Launching executor
> > > > test.445696e6-6e16-11e6-bec9-c27afc834a0c of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with resources
> cpus(*):0.1;
> > > > mem(*):32 in work directory
> > > > '/tmp/mesos/slaves/d6f0e3e2-d144-4275-9d38-82327408622b-
> > > > S12/frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > executors/test.445696e6-6e16-11e6-bec9-c27afc834a0c/runs/
> > > > ea676570-0a2a-49c3-a75c-14e045eb842b'
> > > > I0829 14:27:39.667603  2702 slave.cpp:1698] Queuing task
> > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' for executor
> > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > I0829 14:27:39.668207  2702 containerizer.cpp:666] Starting container
> > > > 'ea676570-0a2a-49c3-a75c-14e045eb842b' for executor
> > > > 'test.445696e6-6e16-11e6-bec9-c27afc834a0c' of framework
> > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > I0829 14:27:39.678665  2702 linux_launcher.cpp:304] Cloning child
> > process
> > > > with flags =
> > > > I0829 14:27:39.681824  2702 containerizer.cpp:1179] Checkpointing
> > > > executor's forked pid 2799 to
> > > > '/tmp/mesos/meta/slaves/d6f0e3e2-d144-4275-9d38-
> > > > 82327408622b-S12/frameworks/c796100f-9ecb-46fa-90a2-
> > > > 72ad649c5dd3-0000/executors/test.445696e6-6e16-11e6-bec9-
> > > > c27afc834a0c/runs/ea676570-0a2a-49c3-a75c-14e045eb842b/
> > pids/forked.pid'
> > > >
> > > > Thanks
> > > > Pankaj
> > > >
> > > >
> > > > On Mon, Aug 29, 2016 at 7:25 AM, haosdent <ha...@gmail.com>
> wrote:
> > > >
> > > > > Hi, @Pankaj, Could you provide logs during " the job is getting
> > > restarted
> > > > > and a new container is created with a new process id. ". The logs
> you
> > > > > provided looks normal.
> > > > >
> > > > > On Mon, Aug 29, 2016 at 5:26 AM, Pankaj Saha <
> psaha4@binghamton.edu>
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > > I am facing an issue with a launched jobs into my mesos agents. I
> > am
> > > > > trying
> > > > > > to launch a job through marathon framework and job is staying in
> > > > stagged
> > > > > > state and not running.
> > > > > > I could see the log message at the agent console as below:
> > > > > >
> > > > > >  Scheduling
> > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > 9d38-82327408622b-S8/
> > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > for gc 6.99999884239407days in the future
> > > > > > I0828 16:20:36.053483 28512 slave.cpp:1361] *Got assigned task
> > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > I0828 16:20:36.056224 28510 gc.cpp:83] Unscheduling
> > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > 72ad649c5dd3-0000'
> > > > > > from gc
> > > > > > I0828 16:20:36.056715 28510 gc.cpp:83] Unscheduling
> > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > 9d38-82327408622b-S8/
> > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > from gc
> > > > > > I0828 16:20:36.057231 28509 slave.cpp:1480] *Launching task
> > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > for framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > I0828 16:20:36.058661 28509 paths.cpp:528]* Trying to chown*
> > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> 13adb145c12d'
> > > > > > to user 'root'
> > > > > > I0828 16:20:36.067807 28509 slave.cpp:5352]* Launching executor
> > > > > > test-crixus*.eb66a42b-6d5c-11e6-bec9-c27afc834a0c
> > > > > > of framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000 with
> > > resources
> > > > > > cpus(*):0.1; mem(*):32 in work directory
> > > > > > '/var/lib/mesos-8082/slaves/d6f0e3e2-d144-4275-9d38-
> > > > > > 82327408622b-S8/frameworks/c796100f-9ecb-46fa-90a2-
> > > > > > 72ad649c5dd3-0000/executors/test-crixus.eb66a42b-6d5c-
> > > > > > 11e6-bec9-c27afc834a0c/runs/99620406-87b5-406c-a88b-
> 13adb145c12d'
> > > > > > I0828 16:20:36.069314 28509 slave.cpp:1698] *Queuing task
> > > > > > 'test-crixus.*eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
> > > > > > for executor 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c'
> of
> > > > > > framework c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000
> > > > > > I0828 16:20:36.069902 28509 containerizer.cpp:666] *Starting
> > > container*
> > > > > > '99620406-87b5-406c-a88b-13adb145c12d' for executor
> > > > > > 'test-crixus.eb66a42b-6d5c-11e6-bec9-c27afc834a0c' of framework
> > > > > > 'c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000'
> > > > > > I0828 16:20:36.080713 28509 linux_launcher.cpp:304] *Cloning
> child
> > > > > process*
> > > > > > with flags =
> > > > > > I0828 16:20:36.084738 28509 containerizer.cpp:1179]
> *Checkpointing
> > > > > > executor's forked pid 29629* to
> > > > > > '/var/lib/mesos-8082/meta/slaves/d6f0e3e2-d144-4275-
> > > > > 9d38-82327408622b-S8/
> > > > > > frameworks/c796100f-9ecb-46fa-90a2-72ad649c5dd3-0000/
> > > > > > executors/test-crixus.eb66a42b-6d5c-11e6-bec9-
> > > > > c27afc834a0c/runs/99620406-
> > > > > > 87b5-406c-a88b-13adb145c12d/pids/forked.pid'
> > > > > >
> > > > > >
> > > > > > But after that, the job is getting restarted and a new container
> is
> > > > > created
> > > > > > with a new process id. It happening infinitely which is keeping
> the
> > > job
> > > > > in
> > > > > > stagged state to mesos-master.
> > > > > >
> > > > > > This job is nothing but a simle echo "hello world" kind of shell
> > > > command.
> > > > > > Can anyone please point out where its failing or I am doing
> wrong.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Pankaj
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > > Haosdent Huang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> > >
> >
>



-- 
Best Regards,
Haosdent Huang