You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2017/04/12 18:19:41 UTC

[jira] [Updated] (MESOS-7008) Quota not recovered from registry in empty cluster.

     [ https://issues.apache.org/jira/browse/MESOS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kone updated MESOS-7008:
------------------------------
    Target Version/s: 1.2.0, 1.1.1, 1.0.4  (was: 1.0.3, 1.1.1, 1.2.0)
       Fix Version/s:     (was: 1.0.3)
                      1.0.4

> Quota not recovered from registry in empty cluster.
> ---------------------------------------------------
>
>                 Key: MESOS-7008
>                 URL: https://issues.apache.org/jira/browse/MESOS-7008
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>    Affects Versions: 1.0.2, 1.1.0
>         Environment: OS X, SSL build
>            Reporter: Benjamin Bannier
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: quota, roles
>             Fix For: 1.1.1, 1.2.0, 1.0.4
>
>
> When a quota was set and the master is restarted, removal of the quota reliably leads to a {{CHECK}} failure for me.
> Start a master:
> {code}
> $ mesos-master --work_dir=work_dir
> {code}
> Set a quota. This creates an implicit role.
> {code}
> $ cat quota.json
> {
>     "role": "role2",
>     "force": true,
>     "guarantee": [
>         {
>             "name": "cpus",
>             "type": "SCALAR",
>             "scalar": { "value": 1 }
>         }
>     ]
> }
> $ cat quota.json| http POST :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:33:38 GMT
> $ http GET :5050/quota
> HTTP/1.1 200 OK
> Content-Length: 108
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:33:56 GMT
> {
>     "infos": [
>         {
>             "guarantee": [
>                 {
>                     "name": "cpus",
>                     "role": "*",
>                     "scalar": {
>                         "value": 1.0
>                     },
>                     "type": "SCALAR"
>                 }
>             ],
>             "role": "role2"
>         }
>     ]
> }
> $ http GET :5050/roles
> HTTP/1.1 200 OK
> Content-Length: 106
> Content-Type: application/json
> Date: Thu, 26 Jan 2017 12:34:10 GMT
> {
>     "roles": [
>         {
>             "frameworks": [],
>             "name": "role2",
>             "resources": {
>                 "cpus": 0,
>                 "disk": 0,
>                 "gpus": 0,
>                 "mem": 0
>             },
>             "weight": 1.0
>         }
>     ]
> }
> {code}
> Restart the master process using the same {{work_dir}} and attempt to delete the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
> {code}
> $ http DELETE :5050/quota/role2
> HTTP/1.1 200 OK
> Content-Length: 0
> Date: Thu, 26 Jan 2017 12:36:04 GMT
> {code}
> After handling the request, the master hits a {{CHECK}} failure and is aborted.
> {code}
> $ mesos-master --work_dir=work_dir
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by bbannier
> I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
> I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: dd07d025d40975ec660ed17031d95ec0dba842d2
> [warn] kq_init: detected broken kqueue; not using.: No such process
> I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' allocator
> I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log positions 3 -> 4 with 0 holes and 0 unlearned
> I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
> I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
> I0126 13:34:57.795964 257187840 master.cpp:383] Master 569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
> I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir" --zk_session_timeout="10secs"
> I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthenticated frameworks to register
> I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthenticated agents to register
> I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP frameworks to register without authentication
> I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5' authenticator
> W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials provided, authentication requests will be refused
> I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing server SASL
> I0126 13:34:57.825337 255578112 master.cpp:2121] Elected as the leading master!
> I0126 13:34:57.825362 255578112 master.cpp:1643] Recovering from registrar
> I0126 13:34:57.825736 255578112 log.cpp:553] Attempting to start the writer
> I0126 13:34:57.826889 258260992 replica.cpp:495] Replica received implicit promise request from __req_res__(1)@172.18.9.56:5050 with proposal 2
> I0126 13:34:57.828855 258260992 replica.cpp:344] Persisted promised to 2
> I0126 13:34:57.829273 258260992 coordinator.cpp:238] Coordinator attempting to fill missing positions
> I0126 13:34:57.829375 259334144 log.cpp:569] Writer started with ending position 4
> I0126 13:34:57.830878 257187840 registrar.cpp:362] Successfully fetched the registry (159B) in 5.427968ms
> I0126 13:34:57.831029 257187840 registrar.cpp:461] Applied 1 operations in 24us; attempting to update the registry
> I0126 13:34:57.836194 259334144 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 5
> I0126 13:34:57.836676 257724416 replica.cpp:539] Replica received write request for position 5 from __req_res__(2)@172.18.9.56:5050
> I0126 13:34:57.837102 255578112 replica.cpp:693] Replica received learned notice for position 5 from @0.0.0.0:0
> I0126 13:34:57.837745 257187840 registrar.cpp:506] Successfully updated the registry in 6.685184ms
> I0126 13:34:57.837806 257187840 registrar.cpp:392] Successfully recovered registrar
> I0126 13:34:57.837924 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 6
> I0126 13:34:57.838132 256651264 master.cpp:1759] Recovered 0 agents from the registry (159B); allowing 10mins for agents to re-register
> I0126 13:34:57.838312 257187840 replica.cpp:539] Replica received write request for position 6 from __req_res__(3)@172.18.9.56:5050
> I0126 13:34:57.838692 256651264 replica.cpp:693] Replica received learned notice for position 6 from @0.0.0.0:0
> I0126 13:36:04.887257 256114688 http.cpp:420] HTTP DELETE for /master/quota/role2 from 127.0.0.1:51458 with User-Agent='HTTPie/0.9.8'
> I0126 13:36:04.887512 255578112 registrar.cpp:461] Applied 1 operations in 42us; attempting to update the registry
> I0126 13:36:04.892643 255578112 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 7
> I0126 13:36:04.893127 258797568 replica.cpp:539] Replica received write request for position 7 from __req_res__(4)@172.18.9.56:5050
> I0126 13:36:04.895309 257187840 replica.cpp:693] Replica received learned notice for position 7 from @0.0.0.0:0
> I0126 13:36:04.895814 258260992 registrar.cpp:506] Successfully updated the registry in 8.2688ms
> F0126 13:36:04.895956 256114688 hierarchical.cpp:1180] Check failed: quotas.contains(role)
> *** Check failure stack trace: ***
> I0126 13:36:04.895961 255578112 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 8
> I0126 13:36:04.896437 257187840 replica.cpp:539] Replica received write request for position 8 from __req_res__(5)@172.18.9.56:5050
> I0126 13:36:04.896908 259334144 replica.cpp:693] Replica received learned notice for position 8 from @0.0.0.0:0
>     @        0x10b5e52aa  google::LogMessage::Fail()
> E0126 13:36:04.905042 259870720 process.cpp:2419] Failed to shutdown socket with fd 11: Socket is not connected
>     @        0x10b5e282c  google::LogMessage::SendToLog()
>     @        0x10b5e3959  google::LogMessage::Flush()
>     @        0x10b5ee159  google::LogMessageFatal::~LogMessageFatal()
>     @        0x10b5e5795  google::LogMessageFatal::~LogMessageFatal()
>     @        0x1089e8d17  mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeQuota()
>     @        0x107ebbc13  _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSt3__112basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEESC_EEvRKNS_3PIDIT_EEMSG_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESP_
>     @        0x107ebbab0  _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EEvRKNS3_3PIDIT_EEMSJ_FvT0_ET1_EUlPNS3_11ProcessBaseEE_SS_EEEvDpOT_
>     @        0x107ebb7b9  _ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EEvRKNS2_3PIDIT_EEMSI_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NSC_ISS_EEFvSR_EEclEOSR_
>     @        0x10b38ba27  std::__1::function<>::operator()()
>     @        0x10b38b96c  process::ProcessBase::visit()
>     @        0x10b40415e  process::DispatchEvent::visit()
>     @        0x107665171  process::ProcessBase::serve()
>     @        0x10b385c07  process::ProcessManager::resume()
>     @        0x10b47db90  process::ProcessManager::init_threads()::$_0::operator()()
>     @        0x10b47d7e0  _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_0EEEEEPvSB_
>     @     0x7fffb2b8eaab  _pthread_body
>     @     0x7fffb2b8e9f7  _pthread_start
>     @     0x7fffb2b8e1fd  thread_start
> [2]    59343 abort      mesos-master --work_dir=work_dir
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)