You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2018/01/18 20:41:00 UTC
[jira] [Commented] (MESOS-8460) `Slave::detachFile` can segfault because it could use invalid Framework*

    [ https://issues.apache.org/jira/browse/MESOS-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331176#comment-16331176 ] 

Vinod Kone commented on MESOS-8460:
-----------------------------------

Debugged the issue with [~mcypark].

The problem comes from the way we capture `this` implicitly via `=` capture in this piece of code

{code}

    slave->garbageCollect(path)

      .onAny(defer(slave->self(), [=](const Future<Nothing>& future) {

        slave->detachFile(path);

 

        if (executor->info.has_type() &&

            executor->info.type() == ExecutorInfo::DEFAULT) {

          foreachvalue (const Task* task, executor->launchedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

 

          foreachvalue (const Task* task, executor->terminatedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

 

          foreach (const shared_ptr<Task>& task, executor->completedTasks) {

            executor->detachTaskVolumeDirectory(*task);

          }

        }

      }));

{code}

 

Specifically, the `slave` pointer inside the onAny lambda actually refers to `this->slave` which is a member variable of `Framework`. Since it is possible that the Framework struct could be deleted before the onAny callback is executed the `slave` pointer could become invalid.

The proposed fix here is to explicitly capture member variables of `Framework` instead of using `=` in the lambda.

Note that there is more than one place in the code where we have to fix this.

 

> `Slave::detachFile` can segfault because it could use invalid Framework*
> ------------------------------------------------------------------------
>
>                 Key: MESOS-8460
>                 URL: https://issues.apache.org/jira/browse/MESOS-8460
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Vinod Kone
>            Assignee: Vinod Kone
>            Priority: Major
>
> Observed this SEGV in an internal cluster
> {code}
> {noformat}
> 2018-01-18 19:00:54: *** SIGSEGV (@0x0) received by PID 26410 (TID 0x7fe9e4f65700) from PID 0; stack trace: ***
> 2018-01-18 19:00:54: @     0x7fe9ea2c85e0 (unknown)
> 2018-01-18 19:00:54: @     0x7fe9ec4cc855 mesos::internal::Files::detach()
> 2018-01-18 19:00:54: @     0x7fe9ec8cb5b0 mesos::internal::slave::Slave::detachFile()
> 2018-01-18 19:00:54: @     0x7fe9ec8ccadb _ZZN5mesos8internal5slave9Framework15recoverExecutorERKNS1_5state13ExecutorStateEbRK7hashsetINS_6TaskIDESt4hashIS8_ESt8equal_toIS8_EEENKUlRKN7process6FutureI7NothingEEE0_clESL_.isra.2000
> 2018-01-18 19:00:54: @     0x7fe9ec37e4e4 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEEEEEclEOS3_
> 2018-01-18 19:00:54: @     0x7fe9ed455ea1 process::ProcessBase::consume()
> 2018-01-18 19:00:54: @     0x7fe9ed464bcc process::ProcessManager::resume()
> 2018-01-18 19:00:54: @     0x7fe9ed46a136 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 2018-01-18 19:00:54: @     0x7fe9ea7a0230 (unknown)
> 2018-01-18 19:00:54: @     0x7fe9ea2c0e25 start_thread
> 2018-01-18 19:00:54: @     0x7fe9e9fee34d __clone
> 2018-01-18 19:00:54: dcos-mesos-slave.service: main process exited, code=killed, status=11/SEGV{noformat}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)