You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benno Evers (JIRA)" <ji...@apache.org> on 2018/08/22 19:28:00 UTC

[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.

    [ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589281#comment-16589281 ] 

Benno Evers commented on MESOS-9177:
------------------------------------

As a preliminary update, I managed to narrow down the location of the segfault to this lambda inside the FullFrameworkWriter:

{code}
      foreach (const Owned<Task>& task, framework_->completedTasks) {
        // Skip unauthorized tasks.
        if (!approvers_->approved<VIEW_TASK>(*task, framework_->info)) {
          continue;
        }

        writer->element(*task);
      }
{code}

Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting a mesos-master built from the same commit locally, using the `no-executor-framework` to run many tasks, and repeatedly hitting the state endpoint on this master. While I was able to overload the JSON renderer of my web browser, I didn't manage to reproduce the crash.

Next, I turned to reverse engineering the exact location of the crash, which seems to be happening while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`). This indicates that we're probably pushing values into this container while simulaneously iterating in another thread.

However, I still haven't figured out a theory for how this could happen, or how to induce the crash locally, since all mutations seem to be happening on the Master actor and thus should not be happening in parallel.

> Mesos master segfaults when responding to /state requests.
> ----------------------------------------------------------
>
>                 Key: MESOS-9177
>                 URL: https://issues.apache.org/jira/browse/MESOS-9177
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>            Reporter: Alexander Rukletsov
>            Assignee: Benno Evers
>            Priority: Blocker
>              Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; stack trace: ***
>  @     0x7f367e7226d0 (unknown)
>  @     0x7f3681266913 _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @     0x7f3681266af0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()()
>  @     0x7f36812889d0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f368121aef0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f3681241be3 _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApproversEEEE_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @     0x7f3681242760 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @     0x7f368215f60e process::http::OK::OK()
>  @     0x7f3681219061 _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApproversEEEE_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @     0x7f36812212c0 _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApproversEEEE_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @     0x7f36812215ac _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApproversEEEE_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEEEEEclEOS3_
>  @     0x7f36821f3541 process::ProcessBase::consume()
>  @     0x7f3682209fbc process::ProcessManager::resume()
>  @     0x7f368220fa76 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @     0x7f367eefc2b0 (unknown)
>  @     0x7f367e71ae25 start_thread
>  @     0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)