You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by Boris Shulman <sh...@gmail.com> on 2016/05/19 06:35:06 UTC

Issue in REEF 0.14

While working on integrating REEF 0.14 we noticed the following issue:

On evaluator failure the driver shuts down:



WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
100.77.210.34:56939 :: java.io.IOException: An existing connection was
forcibly closed by the remote host

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
onEvaluatorException

WARNING: Failed evaluator: container_1462681171587_0057_01_000002

org.apache.reef.exception.EvaluatorException: Evaluator
[container_1462681171587_0057_01_000002] is assumed to be in state
[RUNNING]. But the resource manager reports it to be in state [FAILED].
This means that the Evaluator failed but wasn't able to send an error
message back to the driver. Task [streamingNode0] was running when the
Evaluator crashed.

        at
org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)

        at
org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)

        at
org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)

        at
org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)

        at
org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)

        at
org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)

        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)



May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluator

SEVERE: FailedEvaluator

May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluator

INFO: removing context streamingNode0 from job driver contexts.

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler onNext

INFO: Received a JobStatus message that can't be sent:

identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"

state: RUNNING

message: "Evaluator container_1462681171587_0057_01_000002 failed with
message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
be in state [RUNNING]. But the resource manager reports it to be in state
[FAILED]. This means that the Evaluator failed but wasn\'t able to send an
error message back to the driver. Task [streamingNode0] was running when
the Evaluator crashed."



May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
handleFailedEvaluatorInCLR

INFO: CLR FailedEvaluator handler set, handling things with CLR handler.

May 19, 2016 6:00:28 AM
org.apache.reef.runtime.common.driver.idle.DriverIdleManager
onPotentiallyIdle

INFO*: All components indicated idle. Initiating Driver shutdown.*





I do have Failed Evaluator Handler, and I submit new request:



INFO: +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext

<C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016

START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java

<C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016

EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java

Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
2016-05-19T06:00:28.2886707+00:00 0016

START: 5/19/2016 6:00:28 AM
ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext

<C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016

INFO: FailedEvaluatorClr2Java::GetId

<C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016

START: EvaluatorRequestorClr2Java::Submit





Boris.

Re: Issue in REEF 0.14

Posted by Andrew Chung <af...@gmail.com>.
The CR is out and pending merge.

Thanks,
Andrew

On Thu, May 19, 2016 at 1:55 PM, Boris Shulman <sh...@gmail.com> wrote:

> Thanks. Found it already. Do we have an ETA for REEF-1393
> <https://issues.apache.org/jira/browse/REEF-1393>? It is blocking REEF
> 0.15. Also any idea why we did not see it in REEF 0.12 and earlier (did not
> try 0.13)?
>
> On Thu, May 19, 2016 at 1:20 PM, Andrew Chung <af...@gmail.com> wrote:
>
> > Hi Boris,
> >
> > This is the issue noted in REEF-1393[0]. Email threads[1][2].
> >
> > Thanks,
> > Andrew
> >
> > [0]: https://issues.apache.org/jira/browse/REEF-1393
> > [1]:
> >
> >
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
> > [2]:
> >
> >
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E
> >
> > On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <sh...@gmail.com>
> > wrote:
> >
> > > While working on integrating REEF 0.14 we noticed the following issue:
> > >
> > > On evaluator failure the driver shuts down:
> > >
> > >
> > >
> > > WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> > > 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> > > forcibly closed by the remote host
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> > > onEvaluatorException
> > >
> > > WARNING: Failed evaluator: container_1462681171587_0057_01_000002
> > >
> > > org.apache.reef.exception.EvaluatorException: Evaluator
> > > [container_1462681171587_0057_01_000002] is assumed to be in state
> > > [RUNNING]. But the resource manager reports it to be in state [FAILED].
> > > This means that the Evaluator failed but wasn't able to send an error
> > > message back to the driver. Task [streamingNode0] was running when the
> > > Evaluator crashed.
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
> > >
> > >         at
> > >
> > >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
> > >
> > >         at
> > >
> > >
> >
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
> > >
> > >
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluator
> > >
> > > SEVERE: FailedEvaluator
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluator
> > >
> > > INFO: removing context streamingNode0 from job driver contexts.
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler
> > onNext
> > >
> > > INFO: Received a JobStatus message that can't be sent:
> > >
> > > identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
> > >
> > > state: RUNNING
> > >
> > > message: "Evaluator container_1462681171587_0057_01_000002 failed with
> > > message: Evaluator [container_1462681171587_0057_01_000002] is assumed
> to
> > > be in state [RUNNING]. But the resource manager reports it to be in
> state
> > > [FAILED]. This means that the Evaluator failed but wasn\'t able to send
> > an
> > > error message back to the driver. Task [streamingNode0] was running
> when
> > > the Evaluator crashed."
> > >
> > >
> > >
> > > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > > handleFailedEvaluatorInCLR
> > >
> > > INFO: CLR FailedEvaluator handler set, handling things with CLR
> handler.
> > >
> > > May 19, 2016 6:00:28 AM
> > > org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> > > onPotentiallyIdle
> > >
> > > INFO*: All components indicated idle. Initiating Driver shutdown.*
> > >
> > >
> > >
> > >
> > >
> > > I do have Failed Evaluator Handler, and I submit new request:
> > >
> > >
> > >
> > > INFO:
> > >
> >
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
> > >
> > > <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> > >
> > > START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> > >
> > > <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> > >
> > > EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> > >
> > > Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> > > 2016-05-19T06:00:28.2886707+00:00 0016
> > >
> > > START: 5/19/2016 6:00:28 AM
> > > ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
> > >
> > > <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
> > >
> > > INFO: FailedEvaluatorClr2Java::GetId
> > >
> > > <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
> > >
> > > START: EvaluatorRequestorClr2Java::Submit
> > >
> > >
> > >
> > >
> > >
> > > Boris.
> > >
> >
>

Re: Issue in REEF 0.14

Posted by Boris Shulman <sh...@gmail.com>.
Thanks. Found it already. Do we have an ETA for REEF-1393
<https://issues.apache.org/jira/browse/REEF-1393>? It is blocking REEF
0.15. Also any idea why we did not see it in REEF 0.12 and earlier (did not
try 0.13)?

On Thu, May 19, 2016 at 1:20 PM, Andrew Chung <af...@gmail.com> wrote:

> Hi Boris,
>
> This is the issue noted in REEF-1393[0]. Email threads[1][2].
>
> Thanks,
> Andrew
>
> [0]: https://issues.apache.org/jira/browse/REEF-1393
> [1]:
>
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
> [2]:
>
> https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E
>
> On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <sh...@gmail.com>
> wrote:
>
> > While working on integrating REEF 0.14 we noticed the following issue:
> >
> > On evaluator failure the driver shuts down:
> >
> >
> >
> > WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> > 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> > forcibly closed by the remote host
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> > onEvaluatorException
> >
> > WARNING: Failed evaluator: container_1462681171587_0057_01_000002
> >
> > org.apache.reef.exception.EvaluatorException: Evaluator
> > [container_1462681171587_0057_01_000002] is assumed to be in state
> > [RUNNING]. But the resource manager reports it to be in state [FAILED].
> > This means that the Evaluator failed but wasn't able to send an error
> > message back to the driver. Task [streamingNode0] was running when the
> > Evaluator crashed.
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
> >
> >         at
> >
> >
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
> >
> >         at
> >
> >
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
> >
> >         at
> >
> >
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
> >
> >
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluator
> >
> > SEVERE: FailedEvaluator
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluator
> >
> > INFO: removing context streamingNode0 from job driver contexts.
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler
> onNext
> >
> > INFO: Received a JobStatus message that can't be sent:
> >
> > identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
> >
> > state: RUNNING
> >
> > message: "Evaluator container_1462681171587_0057_01_000002 failed with
> > message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
> > be in state [RUNNING]. But the resource manager reports it to be in state
> > [FAILED]. This means that the Evaluator failed but wasn\'t able to send
> an
> > error message back to the driver. Task [streamingNode0] was running when
> > the Evaluator crashed."
> >
> >
> >
> > May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> > handleFailedEvaluatorInCLR
> >
> > INFO: CLR FailedEvaluator handler set, handling things with CLR handler.
> >
> > May 19, 2016 6:00:28 AM
> > org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> > onPotentiallyIdle
> >
> > INFO*: All components indicated idle. Initiating Driver shutdown.*
> >
> >
> >
> >
> >
> > I do have Failed Evaluator Handler, and I submit new request:
> >
> >
> >
> > INFO:
> >
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
> >
> > <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> >
> > START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> >
> > <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
> >
> > EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
> >
> > Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> > 2016-05-19T06:00:28.2886707+00:00 0016
> >
> > START: 5/19/2016 6:00:28 AM
> > ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
> >
> > <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
> >
> > INFO: FailedEvaluatorClr2Java::GetId
> >
> > <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
> >
> > START: EvaluatorRequestorClr2Java::Submit
> >
> >
> >
> >
> >
> > Boris.
> >
>

Re: Issue in REEF 0.14

Posted by Andrew Chung <af...@gmail.com>.
Hi Boris,

This is the issue noted in REEF-1393[0]. Email threads[1][2].

Thanks,
Andrew

[0]: https://issues.apache.org/jira/browse/REEF-1393
[1]:
https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZSxaVA-xyRg2w7US=vzzvY+Qo03psXmw72=xJphJ1gdkg@mail.gmail.com%3E
[2]:
https://mail-archives.apache.org/mod_mbox/reef-dev/201605.mbox/%3CCAHzzxZQuYHaxpGX--eOst0WwT8dsJN7FoOSpL=pw3F0NQUZ+6Q@mail.gmail.com%3E

On Wed, May 18, 2016 at 11:35 PM, Boris Shulman <sh...@gmail.com> wrote:

> While working on integrating REEF 0.14 we noticed the following issue:
>
> On evaluator failure the driver shuts down:
>
>
>
> WARNING: ExceptionEvent: local: /100.77.230.68:17237 remote: /
> 100.77.210.34:56939 :: java.io.IOException: An existing connection was
> forcibly closed by the remote host
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager
> onEvaluatorException
>
> WARNING: Failed evaluator: container_1462681171587_0057_01_000002
>
> org.apache.reef.exception.EvaluatorException: Evaluator
> [container_1462681171587_0057_01_000002] is assumed to be in state
> [RUNNING]. But the resource manager reports it to be in state [FAILED].
> This means that the Evaluator failed but wasn't able to send an error
> message back to the driver. Task [streamingNode0] was running when the
> Evaluator crashed.
>
>         at
>
> org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
>
>         at
>
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
>
>         at
>
> org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
>
>         at
>
> org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
>
>         at
>
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)
>
>
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluator
>
> SEVERE: FailedEvaluator
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluator
>
> INFO: removing context streamingNode0 from job driver contexts.
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.client.LoggingJobStatusHandler onNext
>
> INFO: Received a JobStatus message that can't be sent:
>
> identifier: "9b94916e-d860-4ca0-8ca8-4be412a70d47"
>
> state: RUNNING
>
> message: "Evaluator container_1462681171587_0057_01_000002 failed with
> message: Evaluator [container_1462681171587_0057_01_000002] is assumed to
> be in state [RUNNING]. But the resource manager reports it to be in state
> [FAILED]. This means that the Evaluator failed but wasn\'t able to send an
> error message back to the driver. Task [streamingNode0] was running when
> the Evaluator crashed."
>
>
>
> May 19, 2016 6:00:28 AM org.apache.reef.javabridge.generic.JobDriver
> handleFailedEvaluatorInCLR
>
> INFO: CLR FailedEvaluator handler set, handling things with CLR handler.
>
> May 19, 2016 6:00:28 AM
> org.apache.reef.runtime.common.driver.idle.DriverIdleManager
> onPotentiallyIdle
>
> INFO*: All components indicated idle. Initiating Driver shutdown.*
>
>
>
>
>
> I do have Failed Evaluator Handler, and I submit new request:
>
>
>
> INFO:
> +Java_org_apache_reef_javabridge_NativeInterop_clrSystemFailedEvaluatorHandlerOnNext
>
> <C++> Start: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
>
> START: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
>
> <C++> Stop: 0 : 2016-05-19T06:00:28.2729926+00:00 0016
>
> EXIT: FailedEvaluatorClr2Java::FailedEvaluatorClr2Java
>
> Org.Apache.REEF.Driver.Bridge.ClrSystemHandlerWrapper Start: 0 :
> 2016-05-19T06:00:28.2886707+00:00 0016
>
> START: 5/19/2016 6:00:28 AM
> ClrSystemHandlerWrapper::Call_ClrSystemFailedEvaluator_OnNext
>
> <C++> Information: 0 : 2016-05-19T06:00:28.2886707+00:00 0016
>
> INFO: FailedEvaluatorClr2Java::GetId
>
> <C++> Start: 0 : 2016-05-19T06:00:28.8042796+00:00 0016
>
> START: EvaluatorRequestorClr2Java::Submit
>
>
>
>
>
> Boris.
>