You are viewing a plain text version of this content. The canonical link for it is here.

Posted to proton@qpid.apache.org by Michael Goulish <mg...@redhat.com> on 2013/02/19 13:40:16 UTC

the killer node

Well, it looks like one of my nodes can kill the other one by doing a put.
No errors reported by either messenger before the fatality.

I'd like to see if someone else can confirm this result,
and maybe see something that I am not seeing.

compile and run scripts are provided in the directory, called "node".


I am testing this against unpatched 0.4 RC1 code.  ( But result was same with 
Ken's recent patch for infinite credit. )


  1. Two instances of one program are used.  Node A only receives, 
     Node B only sends to it.

  2. Start node A first, with the script "r1".  
     It will go through its main loop, trying to receive
     and timing out, for as long as you like.


  3. Start node B, with script r2.
     It will pause after formatting it first message, and will
     then do a dramatic 5-second countdown.  Then it calls 
     put  ( not send! )  and node *A* dies horribly, its core
     file spattering the hard disk.

     Node B is unaware of the carnage it has caused, sedated
     by a sleep loop, tragically still expecting to call send
     and start talking to its partner, node A.


( see attached -- if you dare. )

Re: the killer node

Posted by Michael Goulish <mg...@redhat.com>.

Oh it has to work then .... testing ...  and it does.


But I do get this unusual compiler warning:


  warning: ISO C90 forbids fools and madmen to program in this language.  Go learn Haskell and leave me alone.


huh.





----- Original Message -----
From: "Darryl L. Pierce" <dp...@redhat.com>
To: proton@qpid.apache.org
Sent: Tuesday, February 19, 2013 1:37:19 PM
Subject: Re: the killer node

On Tue, Feb 19, 2013 at 11:43:52AM -0500, Michael Goulish wrote:
> 
>   This just in.
> 
>   It's a linking issue.
> 
>   When I changed my two fn names from send() to my_send() 
>   and from recv() to my_recv() ... no more problem.
> 
>   Different behavior on Fedora 17 and Fedora 18.
> 
>   Gulp.
> 
>   I will post more if I learn something useful.

Just for grins, what happens if you set the name back and make it
static?

-- 
Darryl L. Pierce, Sr. Software Engineer @ Red Hat, Inc.
Delivering value year after year.
Red Hat ranks #1 in value among software vendors.
http://www.redhat.com/promo/vendor/

Re: the killer node

Posted by "Darryl L. Pierce" <dp...@redhat.com>.

On Tue, Feb 19, 2013 at 11:43:52AM -0500, Michael Goulish wrote:
> 
>   This just in.
> 
>   It's a linking issue.
> 
>   When I changed my two fn names from send() to my_send() 
>   and from recv() to my_recv() ... no more problem.
> 
>   Different behavior on Fedora 17 and Fedora 18.
> 
>   Gulp.
> 
>   I will post more if I learn something useful.

Just for grins, what happens if you set the name back and make it
static?

-- 
Darryl L. Pierce, Sr. Software Engineer @ Red Hat, Inc.
Delivering value year after year.
Red Hat ranks #1 in value among software vendors.
http://www.redhat.com/promo/vendor/

Re: the killer node

Posted by Rafael Schloming <rh...@alum.mit.edu>.

That's almost the same stack trace I see with send when I comment out the
while (1). The only difference is that it's all under pn_messenger_send
rather than pn_messenger_recv.

This looks to me like the stack is getting corrupted since send is actually
your code yet the trace appears to be claiming that proton is calling into
it which it couldn't possibly do. I'm guessing the whole stack underneath
pn_connector_process (or above it in the trace below) is garbage. Can you
try running under valgrind and see if it spots where the corruption is
happening?

As an aside you should probably also build with debug on as it will be a
little clearer what is going on.

--Rafael

On Tue, Feb 19, 2013 at 7:08 AM, Michael Goulish <mg...@redhat.com>wrote:

> Sorry, I mean to include that.
>
> Here is the stack trace from node A :
>
>
> #0  0x00007fbb74173de8 in vfprintf () from /lib64/libc.so.6
> #1  0x00007fbb74177abf in buffered_vfprintf () from /lib64/libc.so.6
> #2  0x00007fbb74172c1e in vfprintf () from /lib64/libc.so.6
> #3  0x00007fbb7417cd87 in fprintf () from /lib64/libc.so.6
> #4  0x0000000000400f40 in send (name=0x6 <Address 0x6 out of bounds>,
>     messenger=0x149a150, message=0x51,
>     addr=0x4000 <Address 0x4000 out of bounds>) at node.c:44
> #5  0x00007fbb7450f524 in pn_send () from /lib/libqpid-proton.so.1
> #6  0x00007fbb74510883 in pn_connector_process () from
> /lib/libqpid-proton.so.1
> #7  0x00007fbb7450d85a in pn_messenger_tsync () from
> /lib/libqpid-proton.so.1
> #8  0x00007fbb7450d961 in pn_messenger_sync () from
> /lib/libqpid-proton.so.1
> #9  0x00007fbb7450ef6d in pn_messenger_recv () from
> /lib/libqpid-proton.so.1
> #10 0x0000000000401079 in recv (name=0x7fff2f9a5363 "A",
> messenger=0x1493970,
>     message=0x148e010, addr=0x7fff2f9a4360 "amqp://~0.0.0.0:6666") at
> node.c:88
> #11 0x00000000004014e2 in main (argc=3, argv=0x7fff2f9a4888) at node.c:194
>
>
>
>
> If you like I can give you access to my machine.
>
>
>
>
>
>
> ----- Original Message -----
> From: "Rafael Schloming" <rh...@alum.mit.edu>
> To: proton@qpid.apache.org
> Sent: Tuesday, February 19, 2013 9:33:29 AM
> Subject: Re: the killer node
>
> This doesn't happen for me. I see node B loop forever and never send
> anything which is what I would expect given the while (1) { sleep(...); }
> you have in there. What does your debugger say about where node A crashes?
>
> --Rafael
>
> On Tue, Feb 19, 2013 at 4:40 AM, Michael Goulish <mgoulish@redhat.com
> >wrote:
>
> >
> > Well, it looks like one of my nodes can kill the other one by doing a
> put.
> > No errors reported by either messenger before the fatality.
> >
> > I'd like to see if someone else can confirm this result,
> > and maybe see something that I am not seeing.
> >
> > compile and run scripts are provided in the directory, called "node".
> >
> >
> > I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> > with
> > Ken's recent patch for infinite credit. )
> >
> >
> >   1. Two instances of one program are used.  Node A only receives,
> >      Node B only sends to it.
> >
> >   2. Start node A first, with the script "r1".
> >      It will go through its main loop, trying to receive
> >      and timing out, for as long as you like.
> >
> >
> >   3. Start node B, with script r2.
> >      It will pause after formatting it first message, and will
> >      then do a dramatic 5-second countdown.  Then it calls
> >      put  ( not send! )  and node *A* dies horribly, its core
> >      file spattering the hard disk.
> >
> >      Node B is unaware of the carnage it has caused, sedated
> >      by a sleep loop, tragically still expecting to call send
> >      and start talking to its partner, node A.
> >
> >
> > ( see attached -- if you dare. )
> >
> >
> >
> >
>

Re: the killer node

Posted by Michael Goulish <mg...@redhat.com>.

Sorry, I mean to include that.  

Here is the stack trace from node A :


#0  0x00007fbb74173de8 in vfprintf () from /lib64/libc.so.6
#1  0x00007fbb74177abf in buffered_vfprintf () from /lib64/libc.so.6
#2  0x00007fbb74172c1e in vfprintf () from /lib64/libc.so.6
#3  0x00007fbb7417cd87 in fprintf () from /lib64/libc.so.6
#4  0x0000000000400f40 in send (name=0x6 <Address 0x6 out of bounds>, 
    messenger=0x149a150, message=0x51, 
    addr=0x4000 <Address 0x4000 out of bounds>) at node.c:44
#5  0x00007fbb7450f524 in pn_send () from /lib/libqpid-proton.so.1
#6  0x00007fbb74510883 in pn_connector_process () from /lib/libqpid-proton.so.1
#7  0x00007fbb7450d85a in pn_messenger_tsync () from /lib/libqpid-proton.so.1
#8  0x00007fbb7450d961 in pn_messenger_sync () from /lib/libqpid-proton.so.1
#9  0x00007fbb7450ef6d in pn_messenger_recv () from /lib/libqpid-proton.so.1
#10 0x0000000000401079 in recv (name=0x7fff2f9a5363 "A", messenger=0x1493970, 
    message=0x148e010, addr=0x7fff2f9a4360 "amqp://~0.0.0.0:6666") at node.c:88
#11 0x00000000004014e2 in main (argc=3, argv=0x7fff2f9a4888) at node.c:194




If you like I can give you access to my machine.






----- Original Message -----
From: "Rafael Schloming" <rh...@alum.mit.edu>
To: proton@qpid.apache.org
Sent: Tuesday, February 19, 2013 9:33:29 AM
Subject: Re: the killer node

This doesn't happen for me. I see node B loop forever and never send
anything which is what I would expect given the while (1) { sleep(...); }
you have in there. What does your debugger say about where node A crashes?

--Rafael

On Tue, Feb 19, 2013 at 4:40 AM, Michael Goulish <mg...@redhat.com>wrote:

>
> Well, it looks like one of my nodes can kill the other one by doing a put.
> No errors reported by either messenger before the fatality.
>
> I'd like to see if someone else can confirm this result,
> and maybe see something that I am not seeing.
>
> compile and run scripts are provided in the directory, called "node".
>
>
> I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> with
> Ken's recent patch for infinite credit. )
>
>
>   1. Two instances of one program are used.  Node A only receives,
>      Node B only sends to it.
>
>   2. Start node A first, with the script "r1".
>      It will go through its main loop, trying to receive
>      and timing out, for as long as you like.
>
>
>   3. Start node B, with script r2.
>      It will pause after formatting it first message, and will
>      then do a dramatic 5-second countdown.  Then it calls
>      put  ( not send! )  and node *A* dies horribly, its core
>      file spattering the hard disk.
>
>      Node B is unaware of the carnage it has caused, sedated
>      by a sleep loop, tragically still expecting to call send
>      and start talking to its partner, node A.
>
>
> ( see attached -- if you dare. )
>
>
>
>

Re: the killer node

Posted by Rafael Schloming <rh...@alum.mit.edu>.

This doesn't happen for me. I see node B loop forever and never send
anything which is what I would expect given the while (1) { sleep(...); }
you have in there. What does your debugger say about where node A crashes?

--Rafael

On Tue, Feb 19, 2013 at 4:40 AM, Michael Goulish <mg...@redhat.com>wrote:

>
> Well, it looks like one of my nodes can kill the other one by doing a put.
> No errors reported by either messenger before the fatality.
>
> I'd like to see if someone else can confirm this result,
> and maybe see something that I am not seeing.
>
> compile and run scripts are provided in the directory, called "node".
>
>
> I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> with
> Ken's recent patch for infinite credit. )
>
>
>   1. Two instances of one program are used.  Node A only receives,
>      Node B only sends to it.
>
>   2. Start node A first, with the script "r1".
>      It will go through its main loop, trying to receive
>      and timing out, for as long as you like.
>
>
>   3. Start node B, with script r2.
>      It will pause after formatting it first message, and will
>      then do a dramatic 5-second countdown.  Then it calls
>      put  ( not send! )  and node *A* dies horribly, its core
>      file spattering the hard disk.
>
>      Node B is unaware of the carnage it has caused, sedated
>      by a sleep loop, tragically still expecting to call send
>      and start talking to its partner, node A.
>
>
> ( see attached -- if you dare. )
>
>
>
>

Re: the killer node

Posted by Michael Goulish <mg...@redhat.com>.

Sorry for scaring you!

Final update is -- don't use global names in your C app that 
look like libc names!  Or make them static.

Duh.

It's a little bit of a mystery as to why other testers did not
see the same issue, but -- probably nothing Earth-shattering 
here.




----- Original Message -----
From: "Rafael Schloming" <rh...@alum.mit.edu>
To: proton@qpid.apache.org
Sent: Tuesday, February 19, 2013 12:15:55 PM
Subject: Re: the killer node

Doh!

You had me scared there for a while.

--Rafael

On Tue, Feb 19, 2013 at 8:43 AM, Michael Goulish <mg...@redhat.com>wrote:

>
>   This just in.
>
>   It's a linking issue.
>
>   When I changed my two fn names from send() to my_send()
>   and from recv() to my_recv() ... no more problem.
>
>   Different behavior on Fedora 17 and Fedora 18.
>
>   Gulp.
>
>   I will post more if I learn something useful.
>
>
>
>
>
>
>
> ----- Original Message -----
> From: "Michael Goulish" <mg...@redhat.com>
> To: proton@qpid.apache.org
> Sent: Tuesday, February 19, 2013 7:40:16 AM
> Subject: the killer node
>
>
> Well, it looks like one of my nodes can kill the other one by doing a put.
> No errors reported by either messenger before the fatality.
>
> I'd like to see if someone else can confirm this result,
> and maybe see something that I am not seeing.
>
> compile and run scripts are provided in the directory, called "node".
>
>
> I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> with
> Ken's recent patch for infinite credit. )
>
>
>   1. Two instances of one program are used.  Node A only receives,
>      Node B only sends to it.
>
>   2. Start node A first, with the script "r1".
>      It will go through its main loop, trying to receive
>      and timing out, for as long as you like.
>
>
>   3. Start node B, with script r2.
>      It will pause after formatting it first message, and will
>      then do a dramatic 5-second countdown.  Then it calls
>      put  ( not send! )  and node *A* dies horribly, its core
>      file spattering the hard disk.
>
>      Node B is unaware of the carnage it has caused, sedated
>      by a sleep loop, tragically still expecting to call send
>      and start talking to its partner, node A.
>
>
> ( see attached -- if you dare. )
>
>
>
>
>

Re: the killer node

Posted by Rafael Schloming <rh...@alum.mit.edu>.

Doh!

You had me scared there for a while.

--Rafael

On Tue, Feb 19, 2013 at 8:43 AM, Michael Goulish <mg...@redhat.com>wrote:

>
>   This just in.
>
>   It's a linking issue.
>
>   When I changed my two fn names from send() to my_send()
>   and from recv() to my_recv() ... no more problem.
>
>   Different behavior on Fedora 17 and Fedora 18.
>
>   Gulp.
>
>   I will post more if I learn something useful.
>
>
>
>
>
>
>
> ----- Original Message -----
> From: "Michael Goulish" <mg...@redhat.com>
> To: proton@qpid.apache.org
> Sent: Tuesday, February 19, 2013 7:40:16 AM
> Subject: the killer node
>
>
> Well, it looks like one of my nodes can kill the other one by doing a put.
> No errors reported by either messenger before the fatality.
>
> I'd like to see if someone else can confirm this result,
> and maybe see something that I am not seeing.
>
> compile and run scripts are provided in the directory, called "node".
>
>
> I am testing this against unpatched 0.4 RC1 code.  ( But result was same
> with
> Ken's recent patch for infinite credit. )
>
>
>   1. Two instances of one program are used.  Node A only receives,
>      Node B only sends to it.
>
>   2. Start node A first, with the script "r1".
>      It will go through its main loop, trying to receive
>      and timing out, for as long as you like.
>
>
>   3. Start node B, with script r2.
>      It will pause after formatting it first message, and will
>      then do a dramatic 5-second countdown.  Then it calls
>      put  ( not send! )  and node *A* dies horribly, its core
>      file spattering the hard disk.
>
>      Node B is unaware of the carnage it has caused, sedated
>      by a sleep loop, tragically still expecting to call send
>      and start talking to its partner, node A.
>
>
> ( see attached -- if you dare. )
>
>
>
>
>

Re: the killer node

Posted by Michael Goulish <mg...@redhat.com>.

  This just in.

  It's a linking issue.

  When I changed my two fn names from send() to my_send() 
  and from recv() to my_recv() ... no more problem.

  Different behavior on Fedora 17 and Fedora 18.

  Gulp.

  I will post more if I learn something useful.







----- Original Message -----
From: "Michael Goulish" <mg...@redhat.com>
To: proton@qpid.apache.org
Sent: Tuesday, February 19, 2013 7:40:16 AM
Subject: the killer node


Well, it looks like one of my nodes can kill the other one by doing a put.
No errors reported by either messenger before the fatality.

I'd like to see if someone else can confirm this result,
and maybe see something that I am not seeing.

compile and run scripts are provided in the directory, called "node".


I am testing this against unpatched 0.4 RC1 code.  ( But result was same with 
Ken's recent patch for infinite credit. )


  1. Two instances of one program are used.  Node A only receives, 
     Node B only sends to it.

  2. Start node A first, with the script "r1".  
     It will go through its main loop, trying to receive
     and timing out, for as long as you like.


  3. Start node B, with script r2.
     It will pause after formatting it first message, and will
     then do a dramatic 5-second countdown.  Then it calls 
     put  ( not send! )  and node *A* dies horribly, its core
     file spattering the hard disk.

     Node B is unaware of the carnage it has caused, sedated
     by a sleep loop, tragically still expecting to call send
     and start talking to its partner, node A.


( see attached -- if you dare. )