You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chengwei Yang (JIRA)" <ji...@apache.org> on 2014/09/17 10:49:33 UTC
[jira] [Commented] (MESOS-1804) the "store" component cause on-top
framework (chronos) crash
[ https://issues.apache.org/jira/browse/MESOS-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136961#comment-14136961 ]
Chengwei Yang commented on MESOS-1804:
--------------------------------------
This is a big problem about future->state, I did another rounds of test and got different failure, like below.
1. future->state is still PENDING when calling CHECK_READY(), that means future->await() doesn't put future to READY state.
{code}
F0917 16:34:52.340755 8838 org_apache_mesos_state_AbstractState.cpp:330] CHECK_READY(*future): is PENDING
*** Check failure stack trace: ***
@ 0x7fd1412eac6d google::LogMessage::Fail()
@ 0x7fd1412eec87 google::LogMessage::SendToLog()
@ 0x7fd1412ecb09 google::LogMessage::Flush()
@ 0x7fd1412ece0d google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd140cfa416 _CheckFatal::~_CheckFatal()
@ 0x7fd141268202 Java_org_apache_mesos_state_AbstractState__1_1store_1get
@ 0x7fd1812ee308 (unknown)
./bin/start-chronos.bash: line 49: 8774 Aborted (core dumped) java
{code}
2. future->state is in an undefined state, the crash log like below.
{code}
F0917 14:41:17.899598 1868 check.hpp:79] Check failed: f.isReady()
2014-09-17 14:41:17,929 INFO [task_executor_thread-2] (ScheduledTask.scala:20) - Triggering: 'TEST_JOB2483'
2014-09-17 14:41:17,929 INFO [task_executor_thread-2] (TaskManager.scala:141) - Removing task mapping
*** Check failure stack trace: ***
@ 0x7f85b317bc6d google::LogMessage::Fail()
@ 0x7f85b317fc87 google::LogMessage::SendToLog()
@ 0x7f85b317db09 google::LogMessage::Flush()
@ 0x7f85b317de0d google::LogMessageFatal::~LogMessageFatal()
@ 0x7f85b30fab7b _checkReady<>()
@ 0x7f85b30f91cb Java_org_apache_mesos_state_AbstractState__1_1store_1get
@ 0x7f86092efd48 (unknown)
./bin/start-chronos.bash: line 49: 1743 Aborted (core dumped)
{code}
> the "store" component cause on-top framework (chronos) crash
> ------------------------------------------------------------
>
> Key: MESOS-1804
> URL: https://issues.apache.org/jira/browse/MESOS-1804
> Project: Mesos
> Issue Type: Bug
> Environment: mesos-0.19.0
> Reporter: Chengwei Yang
> Assignee: Chengwei Yang
>
> chronos running with mesos-0.19.0 may crash like below.
> {code}
> [2014-09-05 15:21:36,095] INFO State J_chronos_job_34 does not exist yet. Adding to state (com.airbnb.scheduler.state.MesosStatePersistenceStore:146)
> F0905 15:21:36.175230 27727 org_apache_mesos_state_AbstractState.cpp:319] Check failed: future->isReady()
> *** Check failure stack trace: ***
> @ 0x7f4f1ecb199d google::LogMessage::Fail()
> @ 0x7f4f1ecb59b7 google::LogMessage::SendToLog()
> @ 0x7f4f1ecb3839 google::LogMessage::Flush()
> @ 0x7f4f1ecb3b3d google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4f1ec2ef90 Java_org_apache_mesos_state_AbstractState__1_1store_1get
> @ 0x7f4f18293d45 (unknown)
> Aborted (core dumped)
> {code}
> The related code snippet as below:
> {code}
> $ sed -ne '311,334p' src/java/jni/org_apache_mesos_state_AbstractState.cpp
> JNIEXPORT jobject JNICALL Java_org_apache_mesos_state_AbstractState__1_1store_1get
> (JNIEnv* env, jobject thiz, jlong jfuture)
> {
> Future<Option<Variable> >* future = (Future<Option<Variable> >*) jfuture;
> future->await();
> if (future->isFailed()) {
> jclass clazz = env->FindClass("java/util/concurrent/ExecutionException");
> env->ThrowNew(clazz, future->failure().c_str());
> return NULL;
> } else if (future->isDiscarded()) {
> // TODO(benh): Consider throwing an ExecutionException since we
> // never return true for 'isCancelled'.
> jclass clazz = env->FindClass("java/util/concurrent/CancellationException");
> env->ThrowNew(clazz, "Future was discarded");
> return NULL;
> }
> CHECK_READY(*future);
> if (future->get().isSome()) {
> Variable* variable = new Variable(future->get().get());
> {code}
> The root cause seems that CHECK_READY(*future) failed and crashed chronos.
> See chronos issue: https://github.com/airbnb/chronos/issues/253
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)