You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Ilya Pronin <ip...@twopensource.com> on 2021/01/12 01:23:45 UTC

Review Request 73131: Fixed agent reregistration and marking as unreachable race.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/
-----------------------------------------------------------

Review request for mesos and Benjamin Mahler.


Bugs: MESOS-10209
    https://issues.apache.org/jira/browse/MESOS-10209


Repository: mesos


Description
-------

During master failover if agent reregistration runs concurrently with
marking the agent as unreachable and finishes before the MarkUnreachable
operation is complete, the assertion that the agent is in the recovered
set in Master::_markUnreachable() doesn't hold. The reason for this is
because after readmitting the agent the master removes it from the
recovered set in Master::__reregisterSlave().

We can fix this by ignoring agent reregistration requests while a
marking unreachable operation is in progress, similarly to how we do it
for marking gone. Once the marking operation is complete, the agent will
be able to reregister as usual.


Diffs
-----

  src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
  src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 


Diff: https://reviews.apache.org/r/73131/diff/1/


Testing
-------

Ran `make check`. Verified that the new test crashes without the fix.


Thanks,

Ilya Pronin


Re: Review Request 73131: Fixed agent reregistration and marking as unreachable race.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/#review222438
-----------------------------------------------------------



Patch looks great!

Reviews applied: [73131]

Passed command: export OS='ubuntu:16.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/jenkins/buildbot.sh

- Mesos Reviewbot


On Jan. 12, 2021, 1:23 a.m., Ilya Pronin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73131/
> -----------------------------------------------------------
> 
> (Updated Jan. 12, 2021, 1:23 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-10209
>     https://issues.apache.org/jira/browse/MESOS-10209
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During master failover if agent reregistration runs concurrently with
> marking the agent as unreachable and finishes before the MarkUnreachable
> operation is complete, the assertion that the agent is in the recovered
> set in Master::_markUnreachable() doesn't hold. The reason for this is
> because after readmitting the agent the master removes it from the
> recovered set in Master::__reregisterSlave().
> 
> We can fix this by ignoring agent reregistration requests while a
> marking unreachable operation is in progress, similarly to how we do it
> for marking gone. Once the marking operation is complete, the agent will
> be able to reregister as usual.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
>   src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 
> 
> 
> Diff: https://reviews.apache.org/r/73131/diff/1/
> 
> 
> Testing
> -------
> 
> Ran `make check`. Verified that the new test crashes without the fix.
> 
> 
> Thanks,
> 
> Ilya Pronin
> 
>


Re: Review Request 73131: Fixed agent reregistration and marking as unreachable race.

Posted by Ilya Pronin <ip...@twopensource.com>.

> On Jan. 12, 2021, 12:17 p.m., Benjamin Mahler wrote:
> > src/tests/master_tests.cpp
> > Lines 11235-11239 (patched)
> > <https://reviews.apache.org/r/73131/diff/1/?file=2244100#file2244100line11235>
> >
> >     Maybe a TODO that we can use the in-memory registry here if we made it injectable in StartMaster?
> >     
> >     (The benefit being that the tests run faster with the in-memory one).

Added.


> On Jan. 12, 2021, 12:17 p.m., Benjamin Mahler wrote:
> > src/tests/master_tests.cpp
> > Lines 11247 (patched)
> > <https://reviews.apache.org/r/73131/diff/1/?file=2244100#file2244100line11247>
> >
> >     We can just avoid this variable and passing it in to StartSlave since we're using the default flags?

Inlined.


> On Jan. 12, 2021, 12:17 p.m., Benjamin Mahler wrote:
> > src/tests/master_tests.cpp
> > Lines 11268 (patched)
> > <https://reviews.apache.org/r/73131/diff/1/?file=2244100#file2244100line11268>
> >
> >     Don't need to settle here since we're just waiting for mark after.

Fixed.


- Ilya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/#review222447
-----------------------------------------------------------


On Jan. 11, 2021, 5:23 p.m., Ilya Pronin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73131/
> -----------------------------------------------------------
> 
> (Updated Jan. 11, 2021, 5:23 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-10209
>     https://issues.apache.org/jira/browse/MESOS-10209
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During master failover if agent reregistration runs concurrently with
> marking the agent as unreachable and finishes before the MarkUnreachable
> operation is complete, the assertion that the agent is in the recovered
> set in Master::_markUnreachable() doesn't hold. The reason for this is
> because after readmitting the agent the master removes it from the
> recovered set in Master::__reregisterSlave().
> 
> We can fix this by ignoring agent reregistration requests while a
> marking unreachable operation is in progress, similarly to how we do it
> for marking gone. Once the marking operation is complete, the agent will
> be able to reregister as usual.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
>   src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 
> 
> 
> Diff: https://reviews.apache.org/r/73131/diff/1/
> 
> 
> Testing
> -------
> 
> Ran `make check`. Verified that the new test crashes without the fix.
> 
> 
> Thanks,
> 
> Ilya Pronin
> 
>


Re: Review Request 73131: Fixed agent reregistration and marking as unreachable race.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/#review222447
-----------------------------------------------------------


Ship it!





src/tests/master_tests.cpp
Lines 11235-11239 (patched)
<https://reviews.apache.org/r/73131/#comment311501>

    Maybe a TODO that we can use the in-memory registry here if we made it injectable in StartMaster?
    
    (The benefit being that the tests run faster with the in-memory one).



src/tests/master_tests.cpp
Lines 11247 (patched)
<https://reviews.apache.org/r/73131/#comment311502>

    We can just avoid this variable and passing it in to StartSlave since we're using the default flags?



src/tests/master_tests.cpp
Lines 11268 (patched)
<https://reviews.apache.org/r/73131/#comment311503>

    Don't need to settle here since we're just waiting for mark after.


- Benjamin Mahler


On Jan. 12, 2021, 1:23 a.m., Ilya Pronin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/73131/
> -----------------------------------------------------------
> 
> (Updated Jan. 12, 2021, 1:23 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-10209
>     https://issues.apache.org/jira/browse/MESOS-10209
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During master failover if agent reregistration runs concurrently with
> marking the agent as unreachable and finishes before the MarkUnreachable
> operation is complete, the assertion that the agent is in the recovered
> set in Master::_markUnreachable() doesn't hold. The reason for this is
> because after readmitting the agent the master removes it from the
> recovered set in Master::__reregisterSlave().
> 
> We can fix this by ignoring agent reregistration requests while a
> marking unreachable operation is in progress, similarly to how we do it
> for marking gone. Once the marking operation is complete, the agent will
> be able to reregister as usual.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
>   src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 
> 
> 
> Diff: https://reviews.apache.org/r/73131/diff/1/
> 
> 
> Testing
> -------
> 
> Ran `make check`. Verified that the new test crashes without the fix.
> 
> 
> Thanks,
> 
> Ilya Pronin
> 
>


Re: Review Request 73131: Fixed agent reregistration and marking as unreachable race.

Posted by Ilya Pronin <ip...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/
-----------------------------------------------------------

(Updated Jan. 12, 2021, 1:41 p.m.)


Review request for mesos and Benjamin Mahler.


Changes
-------

Simplified.


Bugs: MESOS-10209
    https://issues.apache.org/jira/browse/MESOS-10209


Repository: mesos


Description
-------

During master failover if agent reregistration runs concurrently with
marking the agent as unreachable and finishes before the MarkUnreachable
operation is complete, the assertion that the agent is in the recovered
set in Master::_markUnreachable() doesn't hold. The reason for this is
because after readmitting the agent the master removes it from the
recovered set in Master::__reregisterSlave().

We can fix this by ignoring agent reregistration requests while a
marking unreachable operation is in progress, similarly to how we do it
for marking gone. Once the marking operation is complete, the agent will
be able to reregister as usual.


Diffs (updated)
-----

  src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
  src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 


Diff: https://reviews.apache.org/r/73131/diff/3/

Changes: https://reviews.apache.org/r/73131/diff/2-3/


Testing
-------

Ran `make check`. Verified that the new test crashes without the fix.


Thanks,

Ilya Pronin


Re: Review Request 73131: Fixed agent reregistration and marking as unreachable race.

Posted by Ilya Pronin <ip...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/73131/
-----------------------------------------------------------

(Updated Jan. 12, 2021, 1 p.m.)


Review request for mesos and Benjamin Mahler.


Changes
-------

Addressed review comments.


Bugs: MESOS-10209
    https://issues.apache.org/jira/browse/MESOS-10209


Repository: mesos


Description
-------

During master failover if agent reregistration runs concurrently with
marking the agent as unreachable and finishes before the MarkUnreachable
operation is complete, the assertion that the agent is in the recovered
set in Master::_markUnreachable() doesn't hold. The reason for this is
because after readmitting the agent the master removes it from the
recovered set in Master::__reregisterSlave().

We can fix this by ignoring agent reregistration requests while a
marking unreachable operation is in progress, similarly to how we do it
for marking gone. Once the marking operation is complete, the agent will
be able to reregister as usual.


Diffs (updated)
-----

  src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e 
  src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 


Diff: https://reviews.apache.org/r/73131/diff/2/

Changes: https://reviews.apache.org/r/73131/diff/1-2/


Testing
-------

Ran `make check`. Verified that the new test crashes without the fix.


Thanks,

Ilya Pronin