You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@twill.apache.org by Erick Tryzelaar <er...@gmail.com> on 2014/05/10 02:44:28 UTC

Twill appears to be closing before Kafka has sent all the log output

Good evening,

I'm not quite sure if this is a bug in my code or Twill. I've been working
away on TWILL-78, but I'm running into some basic issues with me not
seeming to receive all the log messages from my containers. You can find a
copy of my toy application here:
https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
twill with 3 containers. Occasionally I get everything I expect, but
sometimes I only seem to get a subset of the log message on the worker
nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
the nodes have:

node1:
...
Launching main: public static void
org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
throws java.lang.Exception []
2014-05-10 00:20:09,334 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
entering barrier
2014-05-10 00:20:09,352 - WARN
[ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
no ConnectionStateListeners registered.
2014-05-10 00:20:11,187 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
barrier
2014-05-10 00:20:11,188 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
2014-05-10 00:20:11,830 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
barrier
2014-05-10 00:20:11,831 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
Main class completed.
Launcher completed
Cleanup directory tmp/twill.launcher-1399681206035-0
----

node2:
...
Launching main: public static void
org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
throws java.lang.Exception []
2014-05-10 00:20:00,133 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
entering barrier
2014-05-10 00:20:00,158 - WARN
[ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
no ConnectionStateListeners registered.
2014-05-10 00:20:02,161 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
barrier
2014-05-10 00:20:02,161 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
2014-05-10 00:20:02,979 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
barrier
2014-05-10 00:20:02,979 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
Main class completed.
Launcher completed
Cleanup directory tmp/twill.launcher-1399681196232-0
----

node3:
...
Launching main: public static void
org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
throws java.lang.Exception []
2014-05-10 00:20:01,191 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
entering barrier
2014-05-10 00:20:01,197 - WARN
[ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
no ConnectionStateListeners registered.
2014-05-10 00:20:02,768 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
barrier
2014-05-10 00:20:02,769 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
2014-05-10 00:20:03,587 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
barrier
2014-05-10 00:20:03,588 - ERROR
[ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
Main class completed.
Launcher completed
Cleanup directory tmp/twill.launcher-1399681197698-0
----

But the driver script in this case only shows the output from 2 nodes:

----
2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] - Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable
2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
before getting completion
2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
entering barrier
2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
entering barrier
2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
[ConnectionStateManager-0]
ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
There are no ConnectionStateListeners registered.
2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
barrier
2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
woken up
2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
of barrier
2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
[ConnectionStateManager-0]
ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
There are no ConnectionStateListeners registered.
2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
barrier
2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
woken up
2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
of barrier
2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
[ServiceDelegate]
GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
error: Connection reset by peer
2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
after shutting down
2014-05-09 17:20:13,312 - ERROR [Thread-3:o.l.g.t.GraphlabApplication$1@130]
- shutting down
---

So is this something I'm doing wrong? Or is Twill or Kafka somehow shutting
down before all the messages have been sent?

Thanks for any help,
-Erick

Re: Twill appears to be closing before Kafka has sent all the log output

Posted by Terence Yim <ch...@gmail.com>.
Hi Erick,

That sounds like bug to me that the Twill log appender is not doing
proper flush when the Runnable terminates. I've created a ticket for
the investigation and fix.

https://issues.apache.org/jira/browse/TWILL-80

Terence

On Thu, May 15, 2014 at 3:58 PM, Erick Tryzelaar
<er...@gmail.com> wrote:
> Hello Terence,
>
> Thanks for getting back to me. So it does seem to be some timing issue.
> I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41)
> to add some extra controls and a driver script. If I do a sleep of `0`, it
> will sometimes work, and sometimes exit before all the `"done"` messages
> have been sent. However, If I do a sleep of `10`, it seems like I get all
> the messages.
>
>
> On Mon, May 12, 2014 at 10:39 AM, Terence Yim <ch...@gmail.com> wrote:
>
>> Hi Erick,
>>
>> So all logs from node3 are entirely missing from the driver view?
>> That's sounds something wrong to me.
>>
>> In Twill, it'll always try to flush the last bit of logs when a
>> container shutdown, however if that flush failed, there would be no
>> retry, since we don't want to hang the stopping of application.
>>
>> However in your case, it seems like you get no log from a particular
>> node. Have you try to add couple seconds sleep in your Runnable.run()
>> before return and see if it is due to shutdown issue or something
>> else?
>>
>> Terence
>>
>> On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
>> <er...@gmail.com> wrote:
>> > Good evening,
>> >
>> > I'm not quite sure if this is a bug in my code or Twill. I've been
>> working
>> > away on TWILL-78, but I'm running into some basic issues with me not
>> > seeming to receive all the log messages from my containers. You can find
>> a
>> > copy of my toy application here:
>> > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
>> > twill with 3 containers. Occasionally I get everything I expect, but
>> > sometimes I only seem to get a subset of the log message on the worker
>> > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
>> > the nodes have:
>> >
>> > node1:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:09,334 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:09,352 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:11,187 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:11,188 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:11,830 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:11,831 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681206035-0
>> > ----
>> >
>> > node2:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:00,133 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:00,158 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:02,161 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:02,161 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:02,979 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:02,979 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681196232-0
>> > ----
>> >
>> > node3:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:01,191 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:01,197 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:02,768 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:02,769 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:03,587 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:03,588 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681197698-0
>> > ----
>> >
>> > But the driver script in this case only shows the output from 2 nodes:
>> >
>> > ----
>> > 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] -
>> Unable
>> > to load native-hadoop library for your platform... using builtin-java
>> > classes where applicable
>> > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
>> > before getting completion
>> > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
>> > entering barrier
>> > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
>> > entering barrier
>> > 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
>> > [ConnectionStateManager-0]
>> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
>> > There are no ConnectionStateListeners registered.
>> > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
>> in
>> > barrier
>> > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
>> > woken up
>> > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
>> out
>> > of barrier
>> > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
>> done
>> > 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
>> > [ConnectionStateManager-0]
>> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
>> > There are no ConnectionStateListeners registered.
>> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
>> in
>> > barrier
>> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
>> > woken up
>> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
>> out
>> > of barrier
>> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
>> done
>> > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
>> > error: Connection reset by peer
>> > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
>> > after shutting down
>> > 2014-05-09 17:20:13,312 - ERROR
>> [Thread-3:o.l.g.t.GraphlabApplication$1@130]
>> > - shutting down
>> > ---
>> >
>> > So is this something I'm doing wrong? Or is Twill or Kafka somehow
>> shutting
>> > down before all the messages have been sent?
>> >
>> > Thanks for any help,
>> > -Erick
>>

Re: Twill appears to be closing before Kafka has sent all the log output

Posted by Erick Tryzelaar <er...@gmail.com>.
Hello Terence,

Thanks for getting back to me. So it does seem to be some timing issue.
I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41)
to add some extra controls and a driver script. If I do a sleep of `0`, it
will sometimes work, and sometimes exit before all the `"done"` messages
have been sent. However, If I do a sleep of `10`, it seems like I get all
the messages.


On Mon, May 12, 2014 at 10:39 AM, Terence Yim <ch...@gmail.com> wrote:

> Hi Erick,
>
> So all logs from node3 are entirely missing from the driver view?
> That's sounds something wrong to me.
>
> In Twill, it'll always try to flush the last bit of logs when a
> container shutdown, however if that flush failed, there would be no
> retry, since we don't want to hang the stopping of application.
>
> However in your case, it seems like you get no log from a particular
> node. Have you try to add couple seconds sleep in your Runnable.run()
> before return and see if it is due to shutdown issue or something
> else?
>
> Terence
>
> On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
> <er...@gmail.com> wrote:
> > Good evening,
> >
> > I'm not quite sure if this is a bug in my code or Twill. I've been
> working
> > away on TWILL-78, but I'm running into some basic issues with me not
> > seeming to receive all the log messages from my containers. You can find
> a
> > copy of my toy application here:
> > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
> > twill with 3 containers. Occasionally I get everything I expect, but
> > sometimes I only seem to get a subset of the log message on the worker
> > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
> > the nodes have:
> >
> > node1:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:09,334 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:09,352 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:11,187 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:11,188 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:11,830 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:11,831 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681206035-0
> > ----
> >
> > node2:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:00,133 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:00,158 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:02,161 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:02,161 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:02,979 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:02,979 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681196232-0
> > ----
> >
> > node3:
> > ...
> > Launching main: public static void
> >
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> > throws java.lang.Exception []
> > 2014-05-10 00:20:01,191 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> > entering barrier
> > 2014-05-10 00:20:01,197 - WARN
> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
> are
> > no ConnectionStateListeners registered.
> > 2014-05-10 00:20:02,768 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> > barrier
> > 2014-05-10 00:20:02,769 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
> woken up
> > 2014-05-10 00:20:03,587 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
> of
> > barrier
> > 2014-05-10 00:20:03,588 - ERROR
> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> > Main class completed.
> > Launcher completed
> > Cleanup directory tmp/twill.launcher-1399681197698-0
> > ----
> >
> > But the driver script in this case only shows the output from 2 nodes:
> >
> > ----
> > 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] -
> Unable
> > to load native-hadoop library for your platform... using builtin-java
> > classes where applicable
> > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
> > before getting completion
> > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> > entering barrier
> > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> > entering barrier
> > 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
> > [ConnectionStateManager-0]
> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> > There are no ConnectionStateListeners registered.
> > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
> in
> > barrier
> > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> > woken up
> > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
> out
> > of barrier
> > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
> done
> > 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
> > [ConnectionStateManager-0]
> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> > There are no ConnectionStateListeners registered.
> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
> in
> > barrier
> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> > woken up
> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
> out
> > of barrier
> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> > [ServiceDelegate]
> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
> done
> > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
> > error: Connection reset by peer
> > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
> > after shutting down
> > 2014-05-09 17:20:13,312 - ERROR
> [Thread-3:o.l.g.t.GraphlabApplication$1@130]
> > - shutting down
> > ---
> >
> > So is this something I'm doing wrong? Or is Twill or Kafka somehow
> shutting
> > down before all the messages have been sent?
> >
> > Thanks for any help,
> > -Erick
>

Re: Twill appears to be closing before Kafka has sent all the log output

Posted by Terence Yim <ch...@gmail.com>.
Hi Erick,

So all logs from node3 are entirely missing from the driver view?
That's sounds something wrong to me.

In Twill, it'll always try to flush the last bit of logs when a
container shutdown, however if that flush failed, there would be no
retry, since we don't want to hang the stopping of application.

However in your case, it seems like you get no log from a particular
node. Have you try to add couple seconds sleep in your Runnable.run()
before return and see if it is due to shutdown issue or something
else?

Terence

On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
<er...@gmail.com> wrote:
> Good evening,
>
> I'm not quite sure if this is a bug in my code or Twill. I've been working
> away on TWILL-78, but I'm running into some basic issues with me not
> seeming to receive all the log messages from my containers. You can find a
> copy of my toy application here:
> https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
> twill with 3 containers. Occasionally I get everything I expect, but
> sometimes I only seem to get a subset of the log message on the worker
> nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
> the nodes have:
>
> node1:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:09,334 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:09,352 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:11,187 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:11,188 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:11,830 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:11,831 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681206035-0
> ----
>
> node2:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:00,133 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:00,158 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:02,161 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:02,161 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:02,979 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:02,979 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681196232-0
> ----
>
> node3:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:01,191 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:01,197 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:02,768 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:02,769 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:03,587 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:03,588 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681197698-0
> ----
>
> But the driver script in this case only shows the output from 2 nodes:
>
> ----
> 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] - Unable
> to load native-hadoop library for your platform... using builtin-java
> classes where applicable
> 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
> before getting completion
> 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> entering barrier
> 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> entering barrier
> 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
> [ConnectionStateManager-0]
> ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> There are no ConnectionStateListeners registered.
> 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
> barrier
> 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> woken up
> 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
> of barrier
> 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
> 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
> [ConnectionStateManager-0]
> ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> There are no ConnectionStateListeners registered.
> 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
> barrier
> 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> woken up
> 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
> of barrier
> 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
> 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
> error: Connection reset by peer
> 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
> after shutting down
> 2014-05-09 17:20:13,312 - ERROR [Thread-3:o.l.g.t.GraphlabApplication$1@130]
> - shutting down
> ---
>
> So is this something I'm doing wrong? Or is Twill or Kafka somehow shutting
> down before all the messages have been sent?
>
> Thanks for any help,
> -Erick