You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ilya (Jira)" <ji...@apache.org> on 2021/03/25 21:17:00 UTC

[jira] [Assigned] (MESOS-1648) Add a --pidfile option to master and agent binaries.

     [ https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilya reassigned MESOS-1648:
---------------------------

    Assignee:     (was: Ilya)

> Add a --pidfile option to master and agent binaries.
> ----------------------------------------------------
>
>                 Key: MESOS-1648
>                 URL: https://issues.apache.org/jira/browse/MESOS-1648
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, master
>            Reporter: Tobias Weingartner
>            Priority: Major
>              Labels: newbie, twitter
>
> Right now we use a number of wrapper scripts to try and keep up a {{/var/run/mesos/mesos-slave.pid}} in order to be able to monitor the process.  This has proven to be somewhat fragile due to the lack of locking and the possibility of races and stale data.
> By adding a {{--pidfile}}, we can obtain a lock on the file to prevent multiple binaries from starting, and to enable the tooling to validate that the lock is held before doing any signaling. We can also do a best effort unlink in the signal handler upon termination:
> {code}
> // Get exclusive access to the file.
> fd = open(O_CREAT ...)
> flock(fd, LOCK_EX)
> if not locked, abort
> ftruncate(fd, 0)
> // Write the pid.
> write(fd, "<pid>")
> // Inside signal handler..
> unlink(pidfile)
> {code}
> Digging around, looks like the open, ftruncate, write pattern is pretty common:
> http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html
> The tooling around it could that the file is locked by the pid inside it, before taking any action (like signaling):
> *Case 1*: If the file does not exist or is not locked, then assume nothing is running. It's possible for something to be running and about to grab the lock, but we'll eventually read it correctly and converge on a single instance started correctly.
> *Case 2*: If the file is locked, and the pid doesn't match, then assume it is running but not as the pid in the file (.. yet). Treat this the same as (1), assume it's not running, and the next attempts to start will eventually converge on a single instance running.
> *Case 3*: If the file is locked, and the pid matches the locker process, then assume it is running as that pid. Note that it's still possible that in between matching the pid and taking an action (e.g. kill), the pid may become stale, but the recycling pattern of pids makes it unlikely to be re-used unless there is a large delay.
> It seems like some tools already do this signal wrapping (note the comment about fcntl and note the race from (3) in the BUGS section):
> http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)