You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Timothy Chen (JIRA)" <ji...@apache.org> on 2014/11/22 00:32:33 UTC

[jira] [Closed] (MESOS-1922) Slave blocks on the fetcher after terminating an executor

     [ https://issues.apache.org/jira/browse/MESOS-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Chen closed MESOS-1922.
-------------------------------

> Slave blocks on the fetcher after terminating an executor
> ---------------------------------------------------------
>
>                 Key: MESOS-1922
>                 URL: https://issues.apache.org/jira/browse/MESOS-1922
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Tobi Knaup
>            Assignee: Timothy Chen
>
> When the slave terminates an executor because the registration timeout hits, it will hold on to the fetcher process if it is still running, and not send a TASK_FAILED until the fetcher exists. Expected behavior would be to terminate both the executor and the fetcher, and send then send the status update immediately.
> Here are some logs:
> {code}
> I1014 11:36:56.761726 209186816 slave.cpp:1139] Launching task download.1370d754-53d1-11e4-9fc2-0a0027000000 for framework 20140927-211310-16777343-5050-44274-0001
> I1014 11:36:56.766891 205430784 containerizer.cpp:394] Starting container 'be0f9918-986a-4692-ba9f-8c07871c5226' for executor 'download.1370d754-53d1-11e4-9fc2-0a0027000000' of framework '20140927-211310-16777343-5050-44274-0001'
> I1014 11:36:56.766922 209186816 slave.cpp:1252] Queuing task 'download.1370d754-53d1-11e4-9fc2-0a0027000000' for executor download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework '20140927-211310-16777343-5050-44274-0001
> I1014 11:36:56.768117 205430784 launcher.cpp:137] Forked child with pid '13624' for container 'be0f9918-986a-4692-ba9f-8c07871c5226'
> I1014 11:36:56.768647 205430784 containerizer.cpp:510] Fetching URIs for container 'be0f9918-986a-4692-ba9f-8c07871c5226' using command '/usr/local/libexec/mesos/mesos-fetcher'
> I1014 11:37:43.375211 207577088 slave.cpp:3132] Current usage 92.50%. Max allowed age: 0ns
> I1014 11:37:56.768044 205967360 slave.cpp:3089] Terminating executor download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001 because it did not register within 1mins
> I1014 11:37:56.768321 206503936 containerizer.cpp:882] Destroying container 'be0f9918-986a-4692-ba9f-8c07871c5226'
> I1014 11:37:56.817491 207577088 containerizer.cpp:997] Executor for container 'be0f9918-986a-4692-ba9f-8c07871c5226' has exited
> {code}
> At this point there is still a running fetcher. After killing it manually I see:
> {code}
> W1014 11:49:06.310417 207040512 containerizer.cpp:872] Ignoring destroy of unknown container: be0f9918-986a-4692-ba9f-8c07871c5226
> E1014 11:49:06.310560 208650240 slave.cpp:2564] Container 'be0f9918-986a-4692-ba9f-8c07871c5226' for executor 'download.1370d754-53d1-11e4-9fc2-0a0027000000' of framework '20140927-211310-16777343-5050-44274-0001' failed to start: Failed to fetch URIs for container 'be0f9918-986a-4692-ba9f-8c07871c5226': exit status 15
> E1014 11:49:06.310597 208650240 slave.cpp:2659] Termination of executor 'download.1370d754-53d1-11e4-9fc2-0a0027000000' of framework '20140927-211310-16777343-5050-44274-0001' failed: Unknown container: be0f9918-986a-4692-ba9f-8c07871c5226
> E1014 11:49:06.310699 205430784 slave.cpp:2945] Failed to unmonitor container for executor download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001: Not monitored
> I1014 11:49:06.315104 208650240 slave.cpp:2115] Handling status update TASK_FAILED (UUID: c216eb10-cfdc-4a9e-a687-e260701daed4) for task download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001 from @0.0.0.0:0
> W1014 11:49:06.315213 208113664 containerizer.cpp:788] Ignoring update for unknown container: be0f9918-986a-4692-ba9f-8c07871c5226
> I1014 11:49:06.315398 205967360 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: c216eb10-cfdc-4a9e-a687-e260701daed4) for task download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001
> I1014 11:49:06.315489 205967360 status_update_manager.cpp:373] Forwarding status update TASK_FAILED (UUID: c216eb10-cfdc-4a9e-a687-e260701daed4) for task download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001 to master@127.0.0.1:5050
> I1014 11:49:06.328732 209186816 status_update_manager.cpp:398] Received status update acknowledgement (UUID: c216eb10-cfdc-4a9e-a687-e260701daed4) for task download.1370d754-53d1-11e4-9fc2-0a0027000000 of framework 20140927-211310-16777343-5050-44274-0001
> I1014 11:49:06.328951 207040512 slave.cpp:2811] Cleaning up executor 'download.1370d754-53d1-11e4-9fc2-0a0027000000' of framework 20140927-211310-16777343-5050-44274-0001
> {code}
> To reproduce, just launch a task with a URI that takes longer than executor registration timeout to download.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)