You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@mesos.apache.org by Armand Grillet <ag...@mesosphere.io> on 2017/12/06 15:57:58 UTC

Review Request 64379: Improved logs displayed after a slave failed recovery.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/
-----------------------------------------------------------

Review request for mesos and Alexander Rukletsov.


Repository: mesos


Description
-------

Add some steps to clean the Docker daemon
state used by the Docker containerizer.


Diffs
-----

  src/slave/slave.cpp 49270013537356c8fe9150d757b064bc3bbae3cb 


Diff: https://reviews.apache.org/r/64379/diff/1/


Testing
-------

None. The previous logs were:
```
Nov: To remedy this do as follows:
Nov: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
Nov: This ensures agent doesn't recover old live executors.
Nov: Step 2: Restart the agent.
```
I have thus removed the tab before `This ensures agent doesn't recover` as it did not appear in the logs.


Thanks,

Armand Grillet

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/#review193278
-----------------------------------------------------------



FAIL: Some Mesos libprocess-tests failed.

Reviews applied: `['64379']`

Failed command: `C:\DCOS\mesos\3rdparty\libprocess\src\tests\Debug\libprocess-tests.exe`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64379

Relevant logs:

- [libprocess-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64379/logs/libprocess-tests-stdout.log):

```
[       OK ] LimiterTest.THREADSAFE_Acquire (3 ms)
[ RUN      ] LimiterTest.THREADSAFE_DiscardMiddle
[       OK ] LimiterTest.THREADSAFE_DiscardMiddle (3 ms)
[ RUN      ] LimiterTest.THREADSAFE_DiscardLast
[       OK ] LimiterTest.THREADSAFE_DiscardLast (2 ms)
[----------] 3 tests from LimiterTest (101 ms total)

[----------] 4 tests from LoopTest
[ RUN      ] LoopTest.Sync
[       OK ] LoopTest.Sync (1 ms)
[ RUN      ] LoopTest.Async
[       OK ] LoopTest.Async (1 ms)
[ RUN      ] LoopTest.DiscardIterate
[       OK ] LoopTest.DiscardIterate (1 ms)
[ RUN      ] LoopTest.DiscardBody
[       OK ] LoopTest.DiscardBody (2 ms)
[----------] 4 tests from LoopTest (108 ms total)

[----------] 9 tests from MetricsTest
[ RUN      ] MetricsTest.Counter
[       OK ] MetricsTest.Counter (3 ms)
[ RUN      ] MetricsTest.THREADSAFE_Gauge
[       OK ] MetricsTest.THREADSAFE_Gauge (2 ms)
[ RUN      ] MetricsTest.Statistics
[       OK ] MetricsTest.Statistics (7 ms)
[ RUN      ] MetricsTest.THREADSAFE_Snapshot
[       OK ] MetricsTest.THREADSAFE_Snapshot (53 ms)
[ RUN      ] MetricsTest.THREADSAFE_SnapshotTimeout
C:\DCOS\mesos\mesos\3rdparty\libprocess\src\tests\metrics_tests.cpp(335): error: Failed to wait 15secs for response
```

- Mesos Reviewbot Windows


On Dec. 8, 2017, 1:23 p.m., Armand Grillet wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64379/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2017, 1:23 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benno Evers.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add some steps to clean the Docker daemon
> state used by the Docker containerizer.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 54d8bcc035227dd6896ffa6e692a91749c0b56a6 
> 
> 
> Diff: https://reviews.apache.org/r/64379/diff/2/
> 
> 
> Testing
> -------
> 
> None. The previous logs were:
> ```
> Nov: To remedy this do as follows:
> Nov: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Nov: This ensures agent doesn't recover old live executors.
> Nov: Step 2: Restart the agent.
> ```
> I have thus removed the tab before `This ensures agent doesn't recover` as it did not appear in the logs.
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Alexander Rukletsov <ru...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/#review193266
-----------------------------------------------------------



Please check how the output actually looks in console and paste it in the testing done section.


src/slave/slave.cpp
Lines 6795 (patched)
<https://reviews.apache.org/r/64379/#comment271806>

    ..., not just those started by Mesos!


- Alexander Rukletsov


On Dec. 8, 2017, 1:23 p.m., Armand Grillet wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64379/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2017, 1:23 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benno Evers.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add some steps to clean the Docker daemon
> state used by the Docker containerizer.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 54d8bcc035227dd6896ffa6e692a91749c0b56a6 
> 
> 
> Diff: https://reviews.apache.org/r/64379/diff/2/
> 
> 
> Testing
> -------
> 
> None. The previous logs were:
> ```
> Nov: To remedy this do as follows:
> Nov: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Nov: This ensures agent doesn't recover old live executors.
> Nov: Step 2: Restart the agent.
> ```
> I have thus removed the tab before `This ensures agent doesn't recover` as it did not appear in the logs.
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/#review193781
-----------------------------------------------------------



FAIL: Some Mesos tests failed.

Reviews applied: `['64379']`

Failed command: `D:\DCOS\mesos\src\mesos-tests.exe --verbose`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64379

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64379/logs/mesos-tests-stdout.log):

```
[ RUN      ] SlaveTest.RunTaskGroupFailedSecretGeneration
[       OK ] SlaveTest.RunTaskGroupFailedSecretGeneration (232 ms)
[ RUN      ] SlaveTest.RunTaskGroupInvalidExecutorSecret
[       OK ] SlaveTest.RunTaskGroupInvalidExecutorSecret (240 ms)
[ RUN      ] SlaveTest.RunTaskGroupReferenceTypeSecret
[       OK ] SlaveTest.RunTaskGroupReferenceTypeSecret (238 ms)
[ RUN      ] SlaveTest.RunTaskGroupGenerateSecretAfterShutdown
[       OK ] SlaveTest.RunTaskGroupGenerateSecretAfterShutdown (251 ms)
[ RUN      ] SlaveTest.KillTaskGroupBetweenRunTaskParts
[       OK ] SlaveTest.KillTaskGroupBetweenRunTaskParts (223 ms)
[ RUN      ] SlaveTest.KillQueuedTaskGroup
[       OK ] SlaveTest.KillQueuedTaskGroup (290 ms)
[ RUN      ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag
[       OK ] SlaveTest.MaxCompletedExecutorsPerFrameworkFlag (1034 ms)
[ RUN      ] SlaveTest.ShutdownV0ExecutorIfItReregistersWithoutReconnect
[       OK ] SlaveTest.ShutdownV0ExecutorIfItReregistersWithoutReconnect (261 ms)
[ RUN      ] SlaveTest.IgnoreV0ExecutorIfItReregistersWithoutReconnect
[       OK ] SlaveTest.IgnoreV0ExecutorIfItReregistersWithoutReconnect (266 ms)
[ RUN      ] SlaveTest.BrowseExecutorSandboxByVirtualPath
[       OK ] SlaveTest.BrowseExecutorSandboxByVirtualPath (307 ms)
[ RUN      ] SlaveTest.DisconnectedExecutorDropsMessages
[       OK ] SlaveTest.DisconnectedExecutorDropsMessages (280 ms)
[ RUN      ] SlaveTest.ResourceProviderSubscribe
[       OK ] SlaveTest.ResourceProviderSubscribe (200 ms)
[ RUN      ] SlaveTest.ResourceVersions
[       OK ] SlaveTest.ResourceVersions (167 ms)
[ RUN      ] SlaveTest.ReconfigurationPolicy
[       OK ] SlaveTest.ReconfigurationPolicy (252 ms)
[ RUN      ] SlaveTest.ResourceProviderReconciliation
```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64379/logs/mesos-tests-stderr.log):

```
    @   00007FF785429FB0  google::LogMessage::SendToLog
    @   00007FF785429797  google::LogMessage::Flush
    @   00007FF78542B2D1  google::LogMessageFatal::~LogMessageFatal
    @   00007FF7833429AC  mesos::internal::slave::Slave::handleResourceProviderMessage
    @   00007FF7834EDB05   ?? 
    @   00007FF7833DB318  std::_Invoker_functor::_Call<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,process::ProcessBase * __ptr64>
    @   00007FF783461008  std::invoke<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,process::ProcessBase * __ptr64>
    @   00007FF78347120B  lambda::internal::Partial<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,std::_Ph<1> >::invoke_expand<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,std::tuple<process::Future<mesos::internal::ResourceProvi
    @   00007FF7833B756A  )<process::ProcessBase * __ptr64
    @   00007FF7833E196C  std::_Invoker_functor::_Call<lambda::internal::Partial<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,std::_Ph<1> >,process::ProcessBase * __ptr64>
    @   00007FF78346746C  std::invoke<lambda::internal::Partial<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,std::_Ph<1> >,process::ProcessBase * __ptr64>
    @   00007FF7833BF4D1  )<lambda::internal::Partial<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,std::_Ph<1> >,process::ProcessBase * __ptr64
    @   00007FF7834FBA26  process::ProcessBase * __ptr64)>::CallableFn<lambda::internal::Partial<<lambda_2c5dd0fb32c5d47ebd14bcab173f55d7>,process::Future<mesos::internal::ResourceProviderMessage>,std::_Ph<1> > >::operator(
    @   00007FF784F199ED  process::ProcessBase * __ptr64)>::operator(
    @   00007FF784DF2AC9  process::ProcessBase::consume
    @   00007FF784F6DB4A  process::DispatchEvent::consume
    @   00007FF7813B43B7  process::ProcessBase::serve
    @   00007FF784E007AB  process::ProcessManager::resume
    @   00007FF784F0A211   ?? 
    @   00007FF784E48CF0  std::_Invoker_functor::_Call<<lambda_124422ac022fa041208b80c1460630d7> >
    @   00007FF784E9E690  std::invoke<<lambda_124422ac022fa041208b80c1460630d7> >
    @   00007FF784E57AAC  std::_LaunchPad<std::unique_ptr<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> >,std::default_delete<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> > > > >::_Execute<0>
    @   00007FF784F55B4A  std::_LaunchPad<std::unique_ptr<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> >,std::default_delete<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> > > > >::_Run
    @   00007FF784F425D8  std::_LaunchPad<std::unique_ptr<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> >,std::default_delete<std::tuple<<lambda_124422ac022fa041208b80c1460630d7> > > > >::_Go
    @   00007FF784F2A2BD  std::_Pad::_Call_func
    @   00007FF7860CC078  invoke_thread_procedure
    @   00007FF7860CBB21  __cdecl*)(void * __ptr64)
    @   00007FFB34601FE4  BaseThreadInitThunk
    @   00007FFB3532EF91  RtlUserThreadStart
```

- Mesos Reviewbot Windows


On Dec. 13, 2017, 7:26 p.m., Armand Grillet wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64379/
> -----------------------------------------------------------
> 
> (Updated Dec. 13, 2017, 7:26 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benno Evers.
> 
> 
> Bugs: MESOS-8328
>     https://issues.apache.org/jira/browse/MESOS-8328
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add some steps to clean the Docker daemon
> state used by the Docker containerizer.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp d997b4272578efffed05d38771f17df387ccac48 
> 
> 
> Diff: https://reviews.apache.org/r/64379/diff/4/
> 
> 
> Testing
> -------
> 
> New logs:
> ```
> E1213 10:58:10.826020 10057 slave.cpp:6738] EXIT with status 1: Failed to perform recovery: <error>
> If recovery failed due to a change in configuration and you want to
> keep the current agent id, you might want to change the
> `--reconfiguration_policy` flag to a more permissive value.
> 
> To restart this agent with a new agent id instead, do as follows:
> rm -f /tmp/agent/meta/slaves/latest
> This ensures that the agent does not recover old live executors.
> 
> If you use the Docker containerizer and think that the Docker
> daemon state is broken, you can try to clear it. But be careful:
> these commands will erase all containers and images from this host,
> not just those started by Mesos!
> docker kill $(docker ps -q)
> docker rm $(docker ps -a -q)
> docker rmi $(docker images -q)
> 
> Finally, restart the agent.
> ```
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Alexander Rukletsov <ru...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/#review194155
-----------------------------------------------------------


Ship it!




Ship It!

- Alexander Rukletsov


On Dec. 13, 2017, 7:26 p.m., Armand Grillet wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64379/
> -----------------------------------------------------------
> 
> (Updated Dec. 13, 2017, 7:26 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benno Evers.
> 
> 
> Bugs: MESOS-8328
>     https://issues.apache.org/jira/browse/MESOS-8328
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add some steps to clean the Docker daemon
> state used by the Docker containerizer.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp d997b4272578efffed05d38771f17df387ccac48 
> 
> 
> Diff: https://reviews.apache.org/r/64379/diff/4/
> 
> 
> Testing
> -------
> 
> New logs:
> ```
> E1213 10:58:10.826020 10057 slave.cpp:6738] EXIT with status 1: Failed to perform recovery: <error>
> If recovery failed due to a change in configuration and you want to
> keep the current agent id, you might want to change the
> `--reconfiguration_policy` flag to a more permissive value.
> 
> To restart this agent with a new agent id instead, do as follows:
> rm -f /tmp/agent/meta/slaves/latest
> This ensures that the agent does not recover old live executors.
> 
> If you use the Docker containerizer and think that the Docker
> daemon state is broken, you can try to clear it. But be careful:
> these commands will erase all containers and images from this host,
> not just those started by Mesos!
> docker kill $(docker ps -q)
> docker rm $(docker ps -a -q)
> docker rmi $(docker images -q)
> 
> Finally, restart the agent.
> ```
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Benno Evers <be...@mesosphere.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/#review193903
-----------------------------------------------------------


Ship it!




Ship It!

- Benno Evers


On Dec. 13, 2017, 7:26 p.m., Armand Grillet wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64379/
> -----------------------------------------------------------
> 
> (Updated Dec. 13, 2017, 7:26 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov and Benno Evers.
> 
> 
> Bugs: MESOS-8328
>     https://issues.apache.org/jira/browse/MESOS-8328
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Add some steps to clean the Docker daemon
> state used by the Docker containerizer.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp d997b4272578efffed05d38771f17df387ccac48 
> 
> 
> Diff: https://reviews.apache.org/r/64379/diff/4/
> 
> 
> Testing
> -------
> 
> New logs:
> ```
> E1213 10:58:10.826020 10057 slave.cpp:6738] EXIT with status 1: Failed to perform recovery: <error>
> If recovery failed due to a change in configuration and you want to
> keep the current agent id, you might want to change the
> `--reconfiguration_policy` flag to a more permissive value.
> 
> To restart this agent with a new agent id instead, do as follows:
> rm -f /tmp/agent/meta/slaves/latest
> This ensures that the agent does not recover old live executors.
> 
> If you use the Docker containerizer and think that the Docker
> daemon state is broken, you can try to clear it. But be careful:
> these commands will erase all containers and images from this host,
> not just those started by Mesos!
> docker kill $(docker ps -q)
> docker rm $(docker ps -a -q)
> docker rmi $(docker images -q)
> 
> Finally, restart the agent.
> ```
> 
> 
> Thanks,
> 
> Armand Grillet
> 
>

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Armand Grillet <ag...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/
-----------------------------------------------------------

(Updated Dec. 13, 2017, 7:26 p.m.)

Review request for mesos, Alexander Rukletsov and Benno Evers.

Changes
-------

Improved message.

Bugs: MESOS-8328
https://issues.apache.org/jira/browse/MESOS-8328

Repository: mesos

Description
-------

Add some steps to clean the Docker daemon
state used by the Docker containerizer.

Diffs (updated)
-----

src/slave/slave.cpp d997b4272578efffed05d38771f17df387ccac48

Diff: https://reviews.apache.org/r/64379/diff/4/

Changes: https://reviews.apache.org/r/64379/diff/3-4/

Testing (updated)
-------

New logs:
```
E1213 10:58:10.826020 10057 slave.cpp:6738] EXIT with status 1: Failed to perform recovery: <error>
If recovery failed due to a change in configuration and you want to
keep the current agent id, you might want to change the
`--reconfiguration_policy` flag to a more permissive value.

To restart this agent with a new agent id instead, do as follows:
rm -f /tmp/agent/meta/slaves/latest
This ensures that the agent does not recover old live executors.

If you use the Docker containerizer and think that the Docker
daemon state is broken, you can try to clear it. But be careful:
these commands will erase all containers and images from this host,
not just those started by Mesos!
docker kill $(docker ps -q)
docker rm $(docker ps -a -q)
docker rmi $(docker images -q)

Finally, restart the agent.
```

Thanks,

Armand Grillet

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Armand Grillet <ag...@mesosphere.io>.

(Updated Dec. 13, 2017, 4:04 p.m.)

Review request for mesos, Alexander Rukletsov and Benno Evers.

Changes
-------

Fixed issue.

Bugs: MESOS-8328
https://issues.apache.org/jira/browse/MESOS-8328

Repository: mesos

Description
-------

Add some steps to clean the Docker daemon
state used by the Docker containerizer.

Diffs (updated)
-----

src/slave/slave.cpp d997b4272578efffed05d38771f17df387ccac48

Diff: https://reviews.apache.org/r/64379/diff/3/

Changes: https://reviews.apache.org/r/64379/diff/2-3/

Testing (updated)
-------

To restart this agent with a new agent id instead, do as follows:
rm -f /tmp/agent/meta/slaves/latest
This ensures that the agent does not recover old live executors.

If you were using the docker containerizer, you might want to clear
the docker daemon state. These commands will erase all containers
and images from this host, not just those started by Mesos!
docker kill $(docker ps -q)
docker rm $(docker ps -a -q)
docker rmi $(docker images -q)

Finally, restart the agent.
```

Thanks,

Armand Grillet

Re: Review Request 64379: Improved logs displayed after a slave failed recovery.

Posted by Armand Grillet <ag...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64379/
-----------------------------------------------------------

(Updated Dec. 8, 2017, 1:23 p.m.)


Review request for mesos, Alexander Rukletsov and Benno Evers.


Changes
-------

Updated message.


Repository: mesos


Description
-------

Add some steps to clean the Docker daemon
state used by the Docker containerizer.


Diffs (updated)
-----

  src/slave/slave.cpp 54d8bcc035227dd6896ffa6e692a91749c0b56a6 


Diff: https://reviews.apache.org/r/64379/diff/2/

Changes: https://reviews.apache.org/r/64379/diff/1-2/


Testing
-------

None. The previous logs were:
```
Nov: To remedy this do as follows:
Nov: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
Nov: This ensures agent doesn't recover old live executors.
Nov: Step 2: Restart the agent.
```
I have thus removed the tab before `This ensures agent doesn't recover` as it did not appear in the logs.


Thanks,

Armand Grillet