You are viewing a plain text version of this content. The canonical link for it is here.

Posted to proton@qpid.apache.org by Michael Goulish <mg...@redhat.com> on 2015/02/25 13:54:49 UTC

I think that's a blocker...

...but if not, somebody please feel free to correct me.

The Jira that I just created -- PROTON-826 -- is for a 
bug I found with my topology testing of the Dispatch Router,
in which I repeatedly kill and restart a router and make 
sure that the router network comes back to the same topology 
that it had before.

As of checkin 01cb00c -- which had no Jira -- it is pretty
easy for my test to blow core.  It looks like an error 
string is being double-freed (maybe) in the proton library.

( full info in the Jira.  https://issues.apache.org/jira/browse/PROTON-826 )

Re: I think that's a blocker...

Posted by Cliff Jansen <cl...@gmail.com>.

The Proton IOCP implementation provides a reactor capability over the
proactor API, which mostly works as you expect.

I would remind folks of the following Windows-isms compared to POSIX:

Windows was late to the tcp/ip party (championing LAN Manager) and 3rd
parties provided tcp/ip stacks for a long time.  A socket was an alien
entity that was supposed to use different APIs from other file handle
things.  This persists to this day and is why you can't mix a Windows
pipe and socket in a single Windows selection mechanism.  Hence the
documented restrictions on pn_pipe().

Well you can mix them in IOCP, because that came along after Microsoft
was in the tcp/ip game.  But that has other consequences: a socket
doesn't have fluid usage in IOCP since it can be bound only once.
Once "selected" within Proton, it cannot be used in another IOCP
context elsewhere.

Proton can
  support external loops
  provide its own loop
  allow applications to completely manage IO themselves (i.e. qpid cpp)

Proton appears to be close to supporting a Dispatch-level of concurrency.

Is the above enough, or do we have to find extensions to these that
are useful to developers but don't stretch the differences between
Windows and POSIX beyond possibility?

On Thu, Feb 26, 2015 at 12:22 PM, Andrew Stitcher <as...@redhat.com> wrote:
> On Thu, 2015-02-26 at 15:09 -0500, Rafael Schloming wrote:
>> ...
>> It sounds like one way or another we need at least some design changes. I
>> don't think it's workable to have overlapping/close but distinct semantics
>> for the API on different platforms (e.g. you can move sockets on one
>> platform but not on another). I'm starting to think we either need one
>> platform to precisely and fully emulate the semantics of the other
>> platform, or they both need to implement some interface that is slightly
>> higher level and can better accommodate the differences.
>
> I may have misremembered, but I think the essential platform difference
> here is that Windows IOCP really tries to implement a Proactor type of
> pattern rather than the reactor type of pattern that we are using in
> Proton.
>
> [Proactor is where fundamentally the system calls back a processing
> function on a thread that you give it]
>
> In the qpid implementation I had to effectively make the exposed
> interface a proactor type interface in all platforms, to bridge the gap.
>
> I'm not sure this is workable in a context where the user can supply
> their own event loop.
>
> Andrew
>
>

Re: I think that's a blocker...

Posted by Andrew Stitcher <as...@redhat.com>.

On Thu, 2015-02-26 at 15:09 -0500, Rafael Schloming wrote:
> ...
> It sounds like one way or another we need at least some design changes. I
> don't think it's workable to have overlapping/close but distinct semantics
> for the API on different platforms (e.g. you can move sockets on one
> platform but not on another). I'm starting to think we either need one
> platform to precisely and fully emulate the semantics of the other
> platform, or they both need to implement some interface that is slightly
> higher level and can better accommodate the differences.

I may have misremembered, but I think the essential platform difference
here is that Windows IOCP really tries to implement a Proactor type of
pattern rather than the reactor type of pattern that we are using in
Proton.

[Proactor is where fundamentally the system calls back a processing
function on a thread that you give it]

In the qpid implementation I had to effectively make the exposed
interface a proactor type interface in all platforms, to bridge the gap.

I'm not sure this is workable in a context where the user can supply
their own event loop.

Andrew

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

On Wed, Feb 25, 2015 at 9:23 PM, Cliff Jansen <cl...@gmail.com> wrote:

> Two usage cases, very desirable in Proton, happen to be mostly trivial
> on POSIX but mostly non-trivial on Windows: multi-threading and
> external loops.  Proton io and selector classes are POSIX-y in look
> and feel, but IO completion ports are very different and can mimic the
> POSIX functionality only so far.  The documented restrictions I wrote
> (PROTON-668) tries to curb expectations for cross platform
> functionality (but still allow use cases like Dispatch).
>
> This, however, is not a cross platform issue.  This particular problem
> is confined to user space and affects all platforms equally.
> Presumably the API needs fixing or we agree to take a step backwards.
>

There is a pretty severe limitation without this fix since there is no way
for the API to communicate socket errors back to the user. Also, even if we
roll back the fix, the wouldblock flag is accessed with exactly the same
pattern as the error slot. I would assume this means there is still an
issue even with the fix rolled back since presumably two different threads
could overwrite the wouldblock and you could get a hang or an error since
one/both of them could get the wrong value.

Note that Bozo previously pointed out (Proton mailing list) that the
> pn_io_t API had threading inconsistencies with pn_io_error() and
> pn_io_wouldblock().  Perhaps a pn_selectable_t should be passed in
> instead of a socket parameter, or proton should maintain a map of
> errors and wouldblock state for sockets until they are pn_close'd (as
> the Windows code already does for completion port descriptor state).
> The former would be more consistent with proton overall, the latter
> would require some user space locking.  Another possibility could be
> to pass in pointers to error/wouldblock state as part of the io call.
>

Passing in specific pointers to error and/or wouldblock state seems like it
is less flexible than passing in a pointer to something more abstract that
can contain not only the error state, but also whatever other thread local
state makes sense. My assumption is/was that pn_io_t provides this context.


> While I still think this is not a Windows issue, and the documentation
> is supposed to reflect the Dispatch pattern and not handcuff it, here
> is more about the pn_io_t implementation:
>
>   Global state is in struct iocp_t, per socket state is in struct
> iocpdesc_t
>   (see iocp.h, the OS looks after this stuff in POSIX)
>
>   It has to set up and tear down the Windows socket system.
>   It has to "know" if the application is using an external loop or
> pn_selector_select
>   Setup the completion port subsystem (unless external loop)
>   It has to find IOCP state for each passed in socket
>   Manage multiple concurrent io operations per socket (N writes + 1 read)
>   Notify PN_READABLE, PN_WRITABLE, etc changes to the selector (if
> any) for each call
>   Do an elaborate tear down on pn_io_free (drain completions, force
> close dangling sockets)
>
> Regarding the documentation, I looked at Dispatch, which had been
> using Proton in a multi-threaded manner for some time with
> considerable success.  The old driver.c (now deprecated) allowed
> simultaneous threads to do
>
>   pn_connector_process()  <- but no two threads sharing a
> connector/transport/socket
>   pn_driver_wait_n() combined with pn_connector_next() <- only one
> thread waiting/choosing at a time
>   pn_driver_wakeup() <- any thread, any time, to unstick things
>   everything else (listen accept connect) considered non-thread safe
>
> which provides plenty of parallelism if you have more than one connection.
>
> The documentation I wrote tried to say that you could do that much on
> any platform, but no more (without risking undefined behavior).
> Things (and the documentation) get further complicated by supporting
> external loops, which prevents the use of IO completion ports
> completely for a given pn_io_t and uses a different set of system
> calls.
>
> Perhaps the doc restrictions could be summarized as:
>
>   One pn_io_t/pn_selector_t, one thread -> no restrictions
>   One pn_io_t/pn_selector_t, multi threads -> limited thread safety
> (Dispatch)
>   One pn_io_t, no pn_selector_t, external loop, one thread -> no
> restrictions
>   One pn_io_t, no selector, external loop, multi threads -> ???
>   multiple pn_io_t: doable, but sockets must stick to one pn_io_
>
>
> Some difficulties you might not expect: Linux doesn't care if sockets
> move between selectors, or if one thread is reading on a socket while
> another is writing to the same one.  Simple things like this would
> have major design and performance implications for Proton on Windows.
>

It sounds like one way or another we need at least some design changes. I
don't think it's workable to have overlapping/close but distinct semantics
for the API on different platforms (e.g. you can move sockets on one
platform but not on another). I'm starting to think we either need one
platform to precisely and fully emulate the semantics of the other
platform, or they both need to implement some interface that is slightly
higher level and can better accommodate the differences.

--Rafael

Re: I think that's a blocker...

Posted by Cliff Jansen <cl...@gmail.com>.

Two usage cases, very desirable in Proton, happen to be mostly trivial
on POSIX but mostly non-trivial on Windows: multi-threading and
external loops.  Proton io and selector classes are POSIX-y in look
and feel, but IO completion ports are very different and can mimic the
POSIX functionality only so far.  The documented restrictions I wrote
(PROTON-668) tries to curb expectations for cross platform
functionality (but still allow use cases like Dispatch).

This, however, is not a cross platform issue.  This particular problem
is confined to user space and affects all platforms equally.
Presumably the API needs fixing or we agree to take a step backwards.

Note that Bozo previously pointed out (Proton mailing list) that the
pn_io_t API had threading inconsistencies with pn_io_error() and
pn_io_wouldblock().  Perhaps a pn_selectable_t should be passed in
instead of a socket parameter, or proton should maintain a map of
errors and wouldblock state for sockets until they are pn_close'd (as
the Windows code already does for completion port descriptor state).
The former would be more consistent with proton overall, the latter
would require some user space locking.  Another possibility could be
to pass in pointers to error/wouldblock state as part of the io call.

While I still think this is not a Windows issue, and the documentation
is supposed to reflect the Dispatch pattern and not handcuff it, here
is more about the pn_io_t implementation:

  Global state is in struct iocp_t, per socket state is in struct iocpdesc_t
  (see iocp.h, the OS looks after this stuff in POSIX)

  It has to set up and tear down the Windows socket system.
  It has to "know" if the application is using an external loop or
pn_selector_select
  Setup the completion port subsystem (unless external loop)
  It has to find IOCP state for each passed in socket
  Manage multiple concurrent io operations per socket (N writes + 1 read)
  Notify PN_READABLE, PN_WRITABLE, etc changes to the selector (if
any) for each call
  Do an elaborate tear down on pn_io_free (drain completions, force
close dangling sockets)

Regarding the documentation, I looked at Dispatch, which had been
using Proton in a multi-threaded manner for some time with
considerable success.  The old driver.c (now deprecated) allowed
simultaneous threads to do

  pn_connector_process()  <- but no two threads sharing a
connector/transport/socket
  pn_driver_wait_n() combined with pn_connector_next() <- only one
thread waiting/choosing at a time
  pn_driver_wakeup() <- any thread, any time, to unstick things
  everything else (listen accept connect) considered non-thread safe

which provides plenty of parallelism if you have more than one connection.

The documentation I wrote tried to say that you could do that much on
any platform, but no more (without risking undefined behavior).
Things (and the documentation) get further complicated by supporting
external loops, which prevents the use of IO completion ports
completely for a given pn_io_t and uses a different set of system
calls.

Perhaps the doc restrictions could be summarized as:

  One pn_io_t/pn_selector_t, one thread -> no restrictions
  One pn_io_t/pn_selector_t, multi threads -> limited thread safety (Dispatch)
  One pn_io_t, no pn_selector_t, external loop, one thread -> no restrictions
  One pn_io_t, no selector, external loop, multi threads -> ???
  multiple pn_io_t: doable, but sockets must stick to one pn_io_t

Some difficulties you might not expect: Linux doesn't care if sockets
move between selectors, or if one thread is reading on a socket while
another is writing to the same one.  Simple things like this would
have major design and performance implications for Proton on Windows.

On Wed, Feb 25, 2015 at 11:57 AM, Rafael Schloming <rh...@alum.mit.edu> wrote:
> Maybe my head is just thick today, but even staring at the docs a couple
> times and reading through what you have below, I can't say I quite
> understand what you're going for. What are the actual constraints for the
> windows APIs and what is the heavyweight stuff pn_io_t is doing?
>
> --Rafael
>
> On Wed, Feb 25, 2015 at 1:02 PM, Cliff Jansen <cl...@gmail.com> wrote:
>
>> A pn_io_t is heavyweight in Windows, because it has an opposite usage
>> pattern and moves a lot of kernel stuff into user space compared to
>> POSIX.
>>
>> The quoted documentation was my attempt to capture the Dispatch usage
>> pattern, which I assumed would be typical of an application trying to
>> spread proton engine use between threads: basically single access to
>> pn_selector_select() via a condition variable, and no more than one
>> thread working on a given selectable (using proton engine
>> encoding/decoding etc., not just io).
>>
>> In the end, we could just add a zillion locks into the Windows code
>> and make it look like it is as thread safe as the POSIX counterpart
>> (which has implicit safety when it does in the kernel what Windows is
>> doing in user space), but that would defeat using IO completion ports
>> at all.  The documentation was my attempt of balancing performance
>> with sophisticated proton usage on multiple platforms.
>>
>> Note that there is only one pn_selector_t allowed per pn_io_t (a very
>> strong Windows completion port requirement, and sockets are bound to a
>> single completion port for life).
>>
>> On Wed, Feb 25, 2015 at 8:52 AM, Rafael Schloming <rh...@alum.mit.edu>
>> wrote:
>> > On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:
>> >
>> >> Would it be safe to assume that any operations on driver->io are not
>> >> thread safe?
>> >>
>> >> Dispatch is a multi-threaded application.  It looks to me as though
>> >> io->error is a resource shared across the threads in an unsafe way.
>> >>
>> >
>> > Interesting... so this is what the docs say:
>> >
>> > /**
>> >  * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
>> >  * pn_io_t object may have zero or one pn_selector_t selectors
>> >  * associated with it (see ::pn_io_selector()).  If one is associated,
>> >  * all the pn_socket_t handles managed by a pn_io_t must use that
>> >  * pn_selector_t instance.
>> >  *
>> >  * The pn_io_t interface is single-threaded. All methods are intended
>> >  * to be used by one thread at a time, except that multiple threads
>> >  * may use:
>> >  *
>> >  *   ::pn_write()
>> >  *   ::pn_send()
>> >  *   ::pn_recv()
>> >  *   ::pn_close()
>> >  *   ::pn_selector_select()
>> >  *
>> >  * provided at most one thread is calling ::pn_selector_select() and
>> >  * the other threads are operating on separate pn_socket_t handles.
>> >  */
>> >
>> > I think this has been somewhat modified by the constraints from the
>> windows
>> > implementation, and I'm not sure I understand completely what the
>> > constraints are there, or entirely what is being described above, but on
>> > the posix front, the pn_io_t is little more than just a holder for an
>> error
>> > slot, and you should have one of these per thread. It shouldn't be a
>> > problem to use send/recv/etc from multiple threads though so long as you
>> > pass in the pn_io_t from the current thread.
>> >
>> > --Rafael
>>

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

Maybe my head is just thick today, but even staring at the docs a couple
times and reading through what you have below, I can't say I quite
understand what you're going for. What are the actual constraints for the
windows APIs and what is the heavyweight stuff pn_io_t is doing?

--Rafael

On Wed, Feb 25, 2015 at 1:02 PM, Cliff Jansen <cl...@gmail.com> wrote:

> A pn_io_t is heavyweight in Windows, because it has an opposite usage
> pattern and moves a lot of kernel stuff into user space compared to
> POSIX.
>
> The quoted documentation was my attempt to capture the Dispatch usage
> pattern, which I assumed would be typical of an application trying to
> spread proton engine use between threads: basically single access to
> pn_selector_select() via a condition variable, and no more than one
> thread working on a given selectable (using proton engine
> encoding/decoding etc., not just io).
>
> In the end, we could just add a zillion locks into the Windows code
> and make it look like it is as thread safe as the POSIX counterpart
> (which has implicit safety when it does in the kernel what Windows is
> doing in user space), but that would defeat using IO completion ports
> at all.  The documentation was my attempt of balancing performance
> with sophisticated proton usage on multiple platforms.
>
> Note that there is only one pn_selector_t allowed per pn_io_t (a very
> strong Windows completion port requirement, and sockets are bound to a
> single completion port for life).
>
> On Wed, Feb 25, 2015 at 8:52 AM, Rafael Schloming <rh...@alum.mit.edu>
> wrote:
> > On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:
> >
> >> Would it be safe to assume that any operations on driver->io are not
> >> thread safe?
> >>
> >> Dispatch is a multi-threaded application.  It looks to me as though
> >> io->error is a resource shared across the threads in an unsafe way.
> >>
> >
> > Interesting... so this is what the docs say:
> >
> > /**
> >  * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
> >  * pn_io_t object may have zero or one pn_selector_t selectors
> >  * associated with it (see ::pn_io_selector()).  If one is associated,
> >  * all the pn_socket_t handles managed by a pn_io_t must use that
> >  * pn_selector_t instance.
> >  *
> >  * The pn_io_t interface is single-threaded. All methods are intended
> >  * to be used by one thread at a time, except that multiple threads
> >  * may use:
> >  *
> >  *   ::pn_write()
> >  *   ::pn_send()
> >  *   ::pn_recv()
> >  *   ::pn_close()
> >  *   ::pn_selector_select()
> >  *
> >  * provided at most one thread is calling ::pn_selector_select() and
> >  * the other threads are operating on separate pn_socket_t handles.
> >  */
> >
> > I think this has been somewhat modified by the constraints from the
> windows
> > implementation, and I'm not sure I understand completely what the
> > constraints are there, or entirely what is being described above, but on
> > the posix front, the pn_io_t is little more than just a holder for an
> error
> > slot, and you should have one of these per thread. It shouldn't be a
> > problem to use send/recv/etc from multiple threads though so long as you
> > pass in the pn_io_t from the current thread.
> >
> > --Rafael
>

Re: I think that's a blocker...

Posted by Cliff Jansen <cl...@gmail.com>.

A pn_io_t is heavyweight in Windows, because it has an opposite usage
pattern and moves a lot of kernel stuff into user space compared to
POSIX.

The quoted documentation was my attempt to capture the Dispatch usage
pattern, which I assumed would be typical of an application trying to
spread proton engine use between threads: basically single access to
pn_selector_select() via a condition variable, and no more than one
thread working on a given selectable (using proton engine
encoding/decoding etc., not just io).

In the end, we could just add a zillion locks into the Windows code
and make it look like it is as thread safe as the POSIX counterpart
(which has implicit safety when it does in the kernel what Windows is
doing in user space), but that would defeat using IO completion ports
at all.  The documentation was my attempt of balancing performance
with sophisticated proton usage on multiple platforms.

Note that there is only one pn_selector_t allowed per pn_io_t (a very
strong Windows completion port requirement, and sockets are bound to a
single completion port for life).

On Wed, Feb 25, 2015 at 8:52 AM, Rafael Schloming <rh...@alum.mit.edu> wrote:
> On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:
>
>> Would it be safe to assume that any operations on driver->io are not
>> thread safe?
>>
>> Dispatch is a multi-threaded application.  It looks to me as though
>> io->error is a resource shared across the threads in an unsafe way.
>>
>
> Interesting... so this is what the docs say:
>
> /**
>  * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
>  * pn_io_t object may have zero or one pn_selector_t selectors
>  * associated with it (see ::pn_io_selector()).  If one is associated,
>  * all the pn_socket_t handles managed by a pn_io_t must use that
>  * pn_selector_t instance.
>  *
>  * The pn_io_t interface is single-threaded. All methods are intended
>  * to be used by one thread at a time, except that multiple threads
>  * may use:
>  *
>  *   ::pn_write()
>  *   ::pn_send()
>  *   ::pn_recv()
>  *   ::pn_close()
>  *   ::pn_selector_select()
>  *
>  * provided at most one thread is calling ::pn_selector_select() and
>  * the other threads are operating on separate pn_socket_t handles.
>  */
>
> I think this has been somewhat modified by the constraints from the windows
> implementation, and I'm not sure I understand completely what the
> constraints are there, or entirely what is being described above, but on
> the posix front, the pn_io_t is little more than just a holder for an error
> slot, and you should have one of these per thread. It shouldn't be a
> problem to use send/recv/etc from multiple threads though so long as you
> pass in the pn_io_t from the current thread.
>
> --Rafael

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

On Wed, Feb 25, 2015 at 12:48 PM, Ted Ross <tr...@redhat.com> wrote:

>
>
> On 02/25/2015 11:52 AM, Rafael Schloming wrote:
>
>> On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:
>>
>>  Would it be safe to assume that any operations on driver->io are not
>>> thread safe?
>>>
>>> Dispatch is a multi-threaded application.  It looks to me as though
>>> io->error is a resource shared across the threads in an unsafe way.
>>>
>>>
>> Interesting... so this is what the docs say:
>>
>> /**
>>   * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
>>   * pn_io_t object may have zero or one pn_selector_t selectors
>>   * associated with it (see ::pn_io_selector()).  If one is associated,
>>   * all the pn_socket_t handles managed by a pn_io_t must use that
>>   * pn_selector_t instance.
>>   *
>>   * The pn_io_t interface is single-threaded. All methods are intended
>>   * to be used by one thread at a time, except that multiple threads
>>   * may use:
>>   *
>>   *   ::pn_write()
>>   *   ::pn_send()
>>   *   ::pn_recv()
>>   *   ::pn_close()
>>   *   ::pn_selector_select()
>>   *
>>   * provided at most one thread is calling ::pn_selector_select() and
>>   * the other threads are operating on separate pn_socket_t handles.
>>   */
>>
>
> I claim that the commit-in-question violates the text above.  Calls to
> pn_send() and pn_recv() are no longer thread-safe because they now use the
> shared error record.
>

You could be right. I'm not entirely sure how to interpret the above text.
I don't know that I would necessarily consider pn_send/pn_recv to be
"methods" of pn_io_t.


>
>> I think this has been somewhat modified by the constraints from the
>> windows
>> implementation, and I'm not sure I understand completely what the
>> constraints are there, or entirely what is being described above, but on
>> the posix front, the pn_io_t is little more than just a holder for an
>> error
>> slot, and you should have one of these per thread. It shouldn't be a
>> problem to use send/recv/etc from multiple threads though so long as you
>> pass in the pn_io_t from the current thread.
>>
>
> It's not desirable to allocate sockets to threads up front (i.e. partition
> the set of sockets into per-thread slots).  I know you didn't say that was
> needed but it's what I infer from the docs for pn_io_t.
>

I don't think the posix implementation requires this.


> Assuming, as you suggest, that pn_io_t is nothing more than a
> thread-specific error notepad seems like a recipe for future disaster
> because pn_io_t is clearly intended to be more than that.
>

It may not work out so well on windows, I honestly don't know what the
situation is there, but certainly for posix systems I think we need
*something* in this area to function as a context that can be associated
with thread-or-smaller granularities. Having some way to return error
information is just one example of a useful context to be able to use at
thread or smaller granularities. If the windows I/O APIs require a
heavier-weight interface, then perhaps we need to factor it into two
different parts.

--Rafael

Re: I think that's a blocker...

Posted by Ted Ross <tr...@redhat.com>.


On 02/25/2015 11:52 AM, Rafael Schloming wrote:
> On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:
>
>> Would it be safe to assume that any operations on driver->io are not
>> thread safe?
>>
>> Dispatch is a multi-threaded application.  It looks to me as though
>> io->error is a resource shared across the threads in an unsafe way.
>>
>
> Interesting... so this is what the docs say:
>
> /**
>   * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
>   * pn_io_t object may have zero or one pn_selector_t selectors
>   * associated with it (see ::pn_io_selector()).  If one is associated,
>   * all the pn_socket_t handles managed by a pn_io_t must use that
>   * pn_selector_t instance.
>   *
>   * The pn_io_t interface is single-threaded. All methods are intended
>   * to be used by one thread at a time, except that multiple threads
>   * may use:
>   *
>   *   ::pn_write()
>   *   ::pn_send()
>   *   ::pn_recv()
>   *   ::pn_close()
>   *   ::pn_selector_select()
>   *
>   * provided at most one thread is calling ::pn_selector_select() and
>   * the other threads are operating on separate pn_socket_t handles.
>   */

I claim that the commit-in-question violates the text above.  Calls to 
pn_send() and pn_recv() are no longer thread-safe because they now use 
the shared error record.

>
> I think this has been somewhat modified by the constraints from the windows
> implementation, and I'm not sure I understand completely what the
> constraints are there, or entirely what is being described above, but on
> the posix front, the pn_io_t is little more than just a holder for an error
> slot, and you should have one of these per thread. It shouldn't be a
> problem to use send/recv/etc from multiple threads though so long as you
> pass in the pn_io_t from the current thread.

It's not desirable to allocate sockets to threads up front (i.e. 
partition the set of sockets into per-thread slots).  I know you didn't 
say that was needed but it's what I infer from the docs for pn_io_t.

Assuming, as you suggest, that pn_io_t is nothing more than a 
thread-specific error notepad seems like a recipe for future disaster 
because pn_io_t is clearly intended to be more than that.

-Ted

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

On Wed, Feb 25, 2015 at 10:49 AM, Ted Ross <tr...@redhat.com> wrote:

> Would it be safe to assume that any operations on driver->io are not
> thread safe?
>
> Dispatch is a multi-threaded application.  It looks to me as though
> io->error is a resource shared across the threads in an unsafe way.
>

Interesting... so this is what the docs say:

/**
 * A ::pn_io_t manages IO for a group of pn_socket_t handles.  A
 * pn_io_t object may have zero or one pn_selector_t selectors
 * associated with it (see ::pn_io_selector()).  If one is associated,
 * all the pn_socket_t handles managed by a pn_io_t must use that
 * pn_selector_t instance.
 *
 * The pn_io_t interface is single-threaded. All methods are intended
 * to be used by one thread at a time, except that multiple threads
 * may use:
 *
 *   ::pn_write()
 *   ::pn_send()
 *   ::pn_recv()
 *   ::pn_close()
 *   ::pn_selector_select()
 *
 * provided at most one thread is calling ::pn_selector_select() and
 * the other threads are operating on separate pn_socket_t handles.
 */

I think this has been somewhat modified by the constraints from the windows
implementation, and I'm not sure I understand completely what the
constraints are there, or entirely what is being described above, but on
the posix front, the pn_io_t is little more than just a holder for an error
slot, and you should have one of these per thread. It shouldn't be a
problem to use send/recv/etc from multiple threads though so long as you
pass in the pn_io_t from the current thread.

--Rafael

Re: I think that's a blocker...

Posted by Ted Ross <tr...@redhat.com>.

Would it be safe to assume that any operations on driver->io are not 
thread safe?

Dispatch is a multi-threaded application.  It looks to me as though 
io->error is a resource shared across the threads in an unsafe way.

-Ted

On 02/25/2015 08:55 AM, Rafael Schloming wrote:
> This isn't necessarily a proton bug. Nothing in the referenced checkin
> actually touches the logic around allocating/freeing error strings, it
> merely causes pn_send/pn_recv to make use of pn_io_t's pn_error_t where
> previously it threw away the error information. This would suggest that
> there is perhaps a pre-existing bug in dispatch where it is calling
> pn_send/pn_recv with a pn_io_t that has been freed, and it is only now
> triggering due to the additional asserts that are encountered due to not
> ignoring the error information.
>
> I could be mistaken, but I would try reproducing this under valgrind. That
> will tell you where the first free occurred and that should hopefully make
> it obvious whether this is indeed a proton bug or whether dispatch is
> somehow freeing the pn_io_t sooner than it should.
>
> (FWIW, if it is indeed a proton bug, then I would agree it is a blocker.)
>
> --Rafael
>
> On Wed, Feb 25, 2015 at 7:54 AM, Michael Goulish <mg...@redhat.com>
> wrote:
>
>> ...but if not, somebody please feel free to correct me.
>>
>> The Jira that I just created -- PROTON-826 -- is for a
>> bug I found with my topology testing of the Dispatch Router,
>> in which I repeatedly kill and restart a router and make
>> sure that the router network comes back to the same topology
>> that it had before.
>>
>> As of checkin 01cb00c -- which had no Jira -- it is pretty
>> easy for my test to blow core.  It looks like an error
>> string is being double-freed (maybe) in the proton library.
>>
>> ( full info in the Jira.  https://issues.apache.org/jira/browse/PROTON-826
>> )
>>
>>
>>
>

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

On Wed, Feb 25, 2015 at 9:53 AM, Alan Conway <ac...@redhat.com> wrote:

> On Wed, 2015-02-25 at 09:04 -0500, Michael Goulish wrote:
> > Good point!  I'm afraid it will take me the rest of my life
> > to reproduce under valgrind .. but ... I'll see what I can do....
>
> Try this in your environment:
>  export MALLOC_PERTURB_=66
> That will cause malloc to immediately fill freed memory with 0x42 bytes
> so it is obvious when you gdb the core dump if someone is using freed
> memory.
>
> It's not as informative as valgrind but has no peformance impact that I
> can detect, and it often helps to crash faster and closer to the real
> problem. Freed memory can hold valid-seeming values for a while so your
> code may not notice immediately, whereas 4242424242 is rarely valid for
> anything.
>
> > In the meantime -- I'm not sure what to do with a Jira if the
> > provenance is in doubt...
>
> Maybe just put a note on it till we know more.
>

+1

FWIW, given that it looks like we are gonna wait for Andrew's SASL changes
to land, and Gordon is away till the end of the week and still working on
some documentation that would be nice to get into the release, I suspect
you have at least a few days to investigate the issue before we get a
release candidate for 0.9.

--Rafael

Re: I think that's a blocker...

Posted by Alan Conway <ac...@redhat.com>.

On Wed, 2015-02-25 at 09:04 -0500, Michael Goulish wrote:
> Good point!  I'm afraid it will take me the rest of my life
> to reproduce under valgrind .. but ... I'll see what I can do....

Try this in your environment:
 export MALLOC_PERTURB_=66
That will cause malloc to immediately fill freed memory with 0x42 bytes
so it is obvious when you gdb the core dump if someone is using freed
memory. 

It's not as informative as valgrind but has no peformance impact that I
can detect, and it often helps to crash faster and closer to the real
problem. Freed memory can hold valid-seeming values for a while so your
code may not notice immediately, whereas 4242424242 is rarely valid for
anything. 
 
> In the meantime -- I'm not sure what to do with a Jira if the
> provenance is in doubt...

Maybe just put a note on it till we know more.

> 
> 
> ----- Original Message -----
> > This isn't necessarily a proton bug. Nothing in the referenced checkin
> > actually touches the logic around allocating/freeing error strings, it
> > merely causes pn_send/pn_recv to make use of pn_io_t's pn_error_t where
> > previously it threw away the error information. This would suggest that
> > there is perhaps a pre-existing bug in dispatch where it is calling
> > pn_send/pn_recv with a pn_io_t that has been freed, and it is only now
> > triggering due to the additional asserts that are encountered due to not
> > ignoring the error information.
> > 
> > I could be mistaken, but I would try reproducing this under valgrind. That
> > will tell you where the first free occurred and that should hopefully make
> > it obvious whether this is indeed a proton bug or whether dispatch is
> > somehow freeing the pn_io_t sooner than it should.
> > 
> > (FWIW, if it is indeed a proton bug, then I would agree it is a blocker.)
> > 
> > --Rafael
> > 
> > On Wed, Feb 25, 2015 at 7:54 AM, Michael Goulish <mg...@redhat.com>
> > wrote:
> > 
> > > ...but if not, somebody please feel free to correct me.
> > >
> > > The Jira that I just created -- PROTON-826 -- is for a
> > > bug I found with my topology testing of the Dispatch Router,
> > > in which I repeatedly kill and restart a router and make
> > > sure that the router network comes back to the same topology
> > > that it had before.
> > >
> > > As of checkin 01cb00c -- which had no Jira -- it is pretty
> > > easy for my test to blow core.  It looks like an error
> > > string is being double-freed (maybe) in the proton library.
> > >
> > > ( full info in the Jira.  https://issues.apache.org/jira/browse/PROTON-826
> > > )
> > >
> > >
> > >
> >

Re: I think that's a blocker...

Posted by Michael Goulish <mg...@redhat.com>.

Good point!  I'm afraid it will take me the rest of my life
to reproduce under valgrind .. but ... I'll see what I can do....

In the meantime -- I'm not sure what to do with a Jira if the
provenance is in doubt...


----- Original Message -----
> This isn't necessarily a proton bug. Nothing in the referenced checkin
> actually touches the logic around allocating/freeing error strings, it
> merely causes pn_send/pn_recv to make use of pn_io_t's pn_error_t where
> previously it threw away the error information. This would suggest that
> there is perhaps a pre-existing bug in dispatch where it is calling
> pn_send/pn_recv with a pn_io_t that has been freed, and it is only now
> triggering due to the additional asserts that are encountered due to not
> ignoring the error information.
> 
> I could be mistaken, but I would try reproducing this under valgrind. That
> will tell you where the first free occurred and that should hopefully make
> it obvious whether this is indeed a proton bug or whether dispatch is
> somehow freeing the pn_io_t sooner than it should.
> 
> (FWIW, if it is indeed a proton bug, then I would agree it is a blocker.)
> 
> --Rafael
> 
> On Wed, Feb 25, 2015 at 7:54 AM, Michael Goulish <mg...@redhat.com>
> wrote:
> 
> > ...but if not, somebody please feel free to correct me.
> >
> > The Jira that I just created -- PROTON-826 -- is for a
> > bug I found with my topology testing of the Dispatch Router,
> > in which I repeatedly kill and restart a router and make
> > sure that the router network comes back to the same topology
> > that it had before.
> >
> > As of checkin 01cb00c -- which had no Jira -- it is pretty
> > easy for my test to blow core.  It looks like an error
> > string is being double-freed (maybe) in the proton library.
> >
> > ( full info in the Jira.  https://issues.apache.org/jira/browse/PROTON-826
> > )
> >
> >
> >
>

Re: I think that's a blocker...

Posted by Rafael Schloming <rh...@alum.mit.edu>.

This isn't necessarily a proton bug. Nothing in the referenced checkin
actually touches the logic around allocating/freeing error strings, it
merely causes pn_send/pn_recv to make use of pn_io_t's pn_error_t where
previously it threw away the error information. This would suggest that
there is perhaps a pre-existing bug in dispatch where it is calling
pn_send/pn_recv with a pn_io_t that has been freed, and it is only now
triggering due to the additional asserts that are encountered due to not
ignoring the error information.

I could be mistaken, but I would try reproducing this under valgrind. That
will tell you where the first free occurred and that should hopefully make
it obvious whether this is indeed a proton bug or whether dispatch is
somehow freeing the pn_io_t sooner than it should.

(FWIW, if it is indeed a proton bug, then I would agree it is a blocker.)

--Rafael

On Wed, Feb 25, 2015 at 7:54 AM, Michael Goulish <mg...@redhat.com>
wrote:

> ...but if not, somebody please feel free to correct me.
>
> The Jira that I just created -- PROTON-826 -- is for a
> bug I found with my topology testing of the Dispatch Router,
> in which I repeatedly kill and restart a router and make
> sure that the router network comes back to the same topology
> that it had before.
>
> As of checkin 01cb00c -- which had no Jira -- it is pretty
> easy for my test to blow core.  It looks like an error
> string is being double-freed (maybe) in the proton library.
>
> ( full info in the Jira.  https://issues.apache.org/jira/browse/PROTON-826
> )
>
>
>