You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by GitBox <gi...@apache.org> on 2022/04/18 19:17:07 UTC

[GitHub] [mesos] cf-natali opened a new pull request, #426: Fixed a crash in Storage Local Resource ProviderProcess.

cf-natali opened a new pull request, #426:
URL: https://github.com/apache/mesos/pull/426

   `StorageLocalResourceProviderProcess::connected` can crash on a check
   that the current state is `DISCONNECTED` if the current state is
   `READY`, which can happen if the periodic reconciliation runs after
   disconnection.
   
   It can be reproduced by running
   `ContentType/AgentResourceProviderConfigApiTest.Add/0` in a loop,
   preferably with some CPU-intensive workload in the background to affect
   the timing.
   
   Update the check to allow `READY` as well.
   
   ```
   3: I0408 09:31:11.591161 19179 http_connection.hpp:338] Ignoring disconnection attempt from stale connection
   3: I0408 09:31:11.591224 19179 http_connection.hpp:338] Ignoring disconnection attempt from stale connection
   3: I0408 09:31:11.591305 19179 http_connection.hpp:227] New endpoint detected at http://172.17.0.3:45793/slave(1162)/api/v1/resource_provider
   3: I0408 09:31:11.593901 19174 http_connection.hpp:283] Connected with the remote endpoint at http://172.17.0.3:45793/slave(1162)/api/v1/resource_provider
   3: I0408 09:31:11.593940 19190 provider.cpp:488] Disconnected from resource provider manager
   3: I0408 09:31:11.594046 19190 provider.cpp:749] Resource provider 5a147f6c-6be9-4c43-9a88-31528644efb9 is in READY state
   3: I0408 09:31:11.594060 19189 status_update_manager_process.hpp:379] Pausing operation status update manager
   3: I0408 09:31:11.594211 19189 status_update_manager_process.hpp:385] Resuming operation status update manager
   3: F0408 09:31:11.594637 19190 provider.cpp:474] Check failed: DISCONNECTED == state (1 vs. 4) 
   3: *** Check failure stack trace: ***
   3: I0408 09:31:11.636463 19191 hierarchical.cpp:1953] Performed allocation for 1 agents in 208808ns
   3:     @     0x7fb58f06191d  google::LogMessage::Fail()
   3:     @     0x7fb58f060ca7  google::LogMessage::SendToLog()
   3:     @     0x7fb58f0615e2  google::LogMessage::Flush()
   3:     @     0x7fb58f0650a8  google::LogMessageFatal::~LogMessageFatal()
   3: I0408 09:31:11.727972 19184 containerizer.cpp:3252] Container org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE has exited
   3: I0408 09:31:11.729707 19186 provisioner.cpp:652] Ignoring destroy request for unknown container org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE
   3: I0408 09:31:11.732404 19188 container_daemon.cpp:189] Invoking post-stop hook for container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE'
   3: I0408 09:31:11.732600 19177 service_manager.cpp:815] Disconnected from endpoint 'unix:///tmp/mesos-csi-NLwX0Z/endpoint.sock' of CSI plugin container org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE
   3: I0408 09:31:11.732837 19176 container_daemon.cpp:121] Launching container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE'
   3: I0408 09:31:11.735456 19197 process.cpp:2781] Returning '404 Not Found' for '/slave(1162)/api/v1'
   3: E0408 09:31:11.736846 19194 container_daemon.cpp:150] Failed to launch container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE': Failed to launch container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE': Unexpected response '404 Not Found' (404 Not Found.)
   3: E0408 09:31:11.737042 19186 service_manager.cpp:843] Container daemon for 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE' failed: Failed to launch container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-local_17120eece4184cbe8473563ff2fdafa6--CONTROLLER_SERVICE-NODE_SERVICE': Unexpected response '404 Not Found' (404 Not Found.)
   3:     @     0x7fb599d7c896  mesos::internal::StorageLocalResourceProviderProcess::connected()
   3:     @     0x7fb599df16de  _ZZN7process8dispatchIN5mesos8internal35StorageLocalResourceProviderProcessEEEvRKNS_3PIDIT_EEMS5_FvvEENKUlPNS_11ProcessBaseEE_clESC_
   3:     @     0x7fb599df15a2  _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal35StorageLocalResourceProviderProcessEEEvRKNS1_3PIDIT_EEMS7_FvvEEUlPNS1_11ProcessBaseEE_JSE_EEEDTclclsr3stdE7forwardIS7_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOS7_DpOSG_
   3:     @     0x7fb599df1566  _ZN6lambda8internal6InvokeIvEclIZN7process8dispatchIN5mesos8internal35StorageLocalResourceProviderProcessEEEvRKNS4_3PIDIT_EEMSA_FvvEEUlPNS4_11ProcessBaseEE_JSH_EEEvOSA_DpOT0_
   3:     @     0x7fb599df150a  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnIZNS1_8dispatchIN5mesos8internal35StorageLocalResourceProviderProcessEEEvRKNS1_3PIDIT_EEMSC_FvvEEUlS3_E_EclEOS3_
   3:     @     0x7fb590239b3b  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
   3:     @     0x7fb5901fb119  process::ProcessBase::consume()
   3:     @     0x7fb5902997f9  _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
   3:     @          0x120b9e4  process::ProcessBase::serve()
   3:     @     0x7fb5901f7c5f  process::ProcessManager::resume()
   3:     @     0x7fb59021fcdb  process::ProcessManager::init_threads()::$_15::operator()()
   3:     @     0x7fb59021fb85  _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE4$_15vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
   3:     @     0x7fb59021fb55  std::_Bind_simple<>::operator()()
   3:     @     0x7fb59021fa49  std::thread::_Impl<>::_M_run()
   3:     @     0x7fb58965fc80  (unknown)
   3:     @     0x7fb58ebc86ba  start_thread
   3:     @     0x7fb588dc541d  clone
   3:     @              (nil)  (unknown)
   ```
   
   Originally seen in Jenkins: https://builds.apache.org/job/Mesos/job/Mesos-Buildbot/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1%20MESOS_TEST_AWAIT_TIMEOUT=60secs,OS=ubuntu%3A16.04,label_exp=ubuntu/140/console


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] qianzhangxa merged pull request #426: Fixed a crash in Storage Local Resource ProviderProcess.

Posted by GitBox <gi...@apache.org>.
qianzhangxa merged PR #426:
URL: https://github.com/apache/mesos/pull/426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] qianzhangxa commented on a diff in pull request #426: Fixed a crash in Storage Local Resource ProviderProcess.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on code in PR #426:
URL: https://github.com/apache/mesos/pull/426#discussion_r853115971


##########
src/resource_provider/storage/provider.cpp:
##########
@@ -471,7 +471,7 @@ StorageLocalResourceProviderProcess::StorageLocalResourceProviderProcess(
 
 void StorageLocalResourceProviderProcess::connected()
 {
-  CHECK_EQ(DISCONNECTED, state);
+  CHECK(state == DISCONNECTED || state == READY) << state;

Review Comment:
   Suggest to change it to:
   
   ```
   CHECK(state == DISCONNECTED || state == READY) << "Unexpected state: " << state;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] cf-natali commented on a diff in pull request #426: Fixed a crash in Storage Local Resource ProviderProcess.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on code in PR #426:
URL: https://github.com/apache/mesos/pull/426#discussion_r853343366


##########
src/resource_provider/storage/provider.cpp:
##########
@@ -471,7 +471,7 @@ StorageLocalResourceProviderProcess::StorageLocalResourceProviderProcess(
 
 void StorageLocalResourceProviderProcess::connected()
 {
-  CHECK_EQ(DISCONNECTED, state);
+  CHECK(state == DISCONNECTED || state == READY) << state;

Review Comment:
   Sure!



##########
src/resource_provider/storage/provider.cpp:
##########
@@ -471,7 +471,7 @@ StorageLocalResourceProviderProcess::StorageLocalResourceProviderProcess(
 
 void StorageLocalResourceProviderProcess::connected()
 {
-  CHECK_EQ(DISCONNECTED, state);
+  CHECK(state == DISCONNECTED || state == READY) << state;

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [mesos] qianzhangxa commented on pull request #426: Fixed a crash in Storage Local Resource ProviderProcess.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on PR #426:
URL: https://github.com/apache/mesos/pull/426#issuecomment-1103385219

   @cf-natali Thanks for the contribution! I assume you have already verified the fix  by running
   `ContentType/AgentResourceProviderConfigApiTest.Add/0` in a loop.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org