You are viewing a plain text version of this content. The canonical link for it is here.
Posted to proton@qpid.apache.org by Rafael Schloming <rh...@alum.mit.edu> on 2015/03/31 18:08:41 UTC

codec changes

Hi Alan,

Sorry I didn't comment on this sooner, I didn't have time to comment on
your original review request during my travels, however I do have some
thoughts on the changes you made to the codec interface. I noticed you
added a separate accessor for the size:

    ssize_t pn_data_encoded_size(pn_data_t *data);

This is alongside the original encode method:

    ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size);

I think this API choice while nice in that it is backwards compatible is
also going to result in code that is roughly twice as slow as it needs to
be in the most common case. Based on my experience implementing and
profiling codec in C, Python, and Java, computing the size of the encoded
data seems to usually be roughly the same amount of work as actually
encoding it regardless of the implementation language. Therefore code like
this:

    if (buffer_size() < pn_data_encoded_size(data)) grow_buffer();
    pn_data_encode(data, buffer, buffer_size());

Can end up being roughly twice as slow as code like this:

    ssize_t err;
    while ((err = pn_data_encode(data, buffer, buffer_size())) ==
PN_OVERFLOW) {
        grow_buffer();
    }

Admittedly the latter form is much more awkward in those cases where you
don't care about performance, so I'm all for providing something nicer, but
I think a better API change would be to steal a page from the C stdio.h
APIs and have pn_data_encode always return the number of bytes that would
have been written had there been enough space. This allows you to write the
simplified encode as above:

    if (buffer_size() < pn_data_encode(data, NULL, 0)) grow_buffer();
    pn_data_encode(data, buffer, buffer_size());

Or use a more optimal form:

   ssize_t n = pn_data_encode(data, buffer, buffer_size());
   if (n > buffer_size()) {
       grow_buffer();
       pn_data_encode(data, buffer, buffer_size());
   }

This makes the slow/convenient form possible, and provides some options
that are a bit less awkward than the loop, but it also makes it very clear
that when you use the slow/convenient form you are incurring roughly twice
the cost of the alternative.

Normally I wouldn't be overly fussed by something like this, and I realize
what I'm suggesting is a breaking change relative to what you provided, but
based on what profiling we've done in the past, codec is probably the most
significant source of overhead that we add to an application, and exactly
this sort of double encode effect is almost always one of the first things
you hit when you try to optimize. Given this, I think it would be a good
thing if the API accurately reflects the relative cost of the different
styles of use.

Thoughts?

--Rafael

Re: codec changes

Posted by Dominic Evans <do...@uk.ibm.com>.
-----Rafael Schloming <rh...@alum.mit.edu> wrote: -----
> On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway <ac...@redhat.com>
> wrote:
>> That works for me, now how do we manage the transition? I don't think
>> we can afford to inflict "yum update proton; all proton apps crash"
>> on our users.  That means we cannot change the behavior of existing
>> function names. I don't much like adding encode_foo2 for every
>> encode_foo but I don't see what else to do.
>>
> 
> We could mark the old ones as deprecated and add the new ones as
> pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode
> to pn_xxx_encode2. Then after a sufficient transition period we can
> get rid of the old versions.
> 
> --Rafael
>

+1 

-- 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
On Mon, May 11, 2015 at 3:49 PM, Alan Conway <ac...@redhat.com> wrote:

> On Thu, 2015-05-07 at 15:53 -0400, Rafael Schloming wrote:
> > I believe where we ended up was standardizing on a single snprintf-style
> > signature for all encode functions, i.e.:
> >
> >     // always returns encoded size, never overwrites limit, pass in
> NULL, 0
> > to compute size in advance
> >     // returns < 0 if there was an actual error of some sort, e.g. xxx_t
> > cannot be validly encoded for some reason
> >     ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit);
> >
> > And transitioning to this with a feature macro.
>
> I'm good with sprintf-style.
>
> I'm not sure what you mean by a feature macro. Does that addresses
> binary compatibility or is just a source-level switch? If we break
> binary compatibility we need to make sure we bump the library version
> appropriately to avoid breaking existing apps. If we set the switch to
> default to the new behavior we should make sure the compiler barfs on
> old code and it's fairly obvious what to do about it.
>

I've never actually defined feature test macros before, just used them, so
I'm just making this up as we go along. What I imagine could work though is
to define pn_xxx_encode2 in the .c file to provide the new behavior. By
default it would not be visible from the header files unless the
appropriate feature macro was defined upon inclusion, in which case we
would alias pn_xxx_encode to pn_xxx_encode2, e.g.:

--------------
#include <codec.h>
#include <message.h>
...
int err = pn_message_encode(msg, buf, &in_out_size); // we could make this
spew a deprecated warning by default
--------------

--------------
#define PN_STANDARD_ENCODE_SIGNATURE
#include <codec.h>
#include <message.h>
...
ssize_t encoded_size = pn_message_encode(msg, buf, bufsize);
--------------

This would allow people to incrementally upgrade and would not break binary
compatibility. We could of course at some later point break binary
compatibility if we want to and clean up the encode2 symbols, especially if
we have other reasons to change ABI.

One thing I'm not sure of is the granularity of the feature test, e.g.
should the macro be specific to the change (PN_STANDARD_ENCODE_SIGNATURE)
or should it be related to the version somehow, e.g. (PN_0_10_SOURCE) or
something like that. I guess both could be an option too, e.g.
PN_0_10_SOURCE could be an alias for all the new features in 0.10,
presumably just this and possibly the sasl stuff if there is a way to make
that work.

--Rafael

Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Thu, 2015-05-07 at 15:53 -0400, Rafael Schloming wrote:
> I believe where we ended up was standardizing on a single snprintf-style
> signature for all encode functions, i.e.:
> 
>     // always returns encoded size, never overwrites limit, pass in NULL, 0
> to compute size in advance
>     // returns < 0 if there was an actual error of some sort, e.g. xxx_t
> cannot be validly encoded for some reason
>     ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit);
> 
> And transitioning to this with a feature macro.

I'm good with sprintf-style.

I'm not sure what you mean by a feature macro. Does that addresses
binary compatibility or is just a source-level switch? If we break
binary compatibility we need to make sure we bump the library version
appropriately to avoid breaking existing apps. If we set the switch to
default to the new behavior we should make sure the compiler barfs on
old code and it's fairly obvious what to do about it.

Cheers,
Alan.

> 
> --Rafael
> 
> On Thu, May 7, 2015 at 3:28 PM, Andrew Stitcher <as...@redhat.com>
> wrote:
> 
> > On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote:
> > > We seem to have reached consensus here, but I haven't seen any commits on
> > > this. We should probably fix this before 0.10 so we don't end up putting
> > > out a new API and then deprecating it in the next release. Is anyone
> > > actually working on this?
> >
> > Could you articulate the consensus then.
> >
> > Asserting "we have reached consensus" without explicitly stating what
> > you think the consensus to be is remarkably likely to be proven wrong by
> > subsequent events!
> >
> > Andrew
> >
> >
> >



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
I believe where we ended up was standardizing on a single snprintf-style
signature for all encode functions, i.e.:

    // always returns encoded size, never overwrites limit, pass in NULL, 0
to compute size in advance
    // returns < 0 if there was an actual error of some sort, e.g. xxx_t
cannot be validly encoded for some reason
    ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit);

And transitioning to this with a feature macro.

--Rafael

On Thu, May 7, 2015 at 3:28 PM, Andrew Stitcher <as...@redhat.com>
wrote:

> On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote:
> > We seem to have reached consensus here, but I haven't seen any commits on
> > this. We should probably fix this before 0.10 so we don't end up putting
> > out a new API and then deprecating it in the next release. Is anyone
> > actually working on this?
>
> Could you articulate the consensus then.
>
> Asserting "we have reached consensus" without explicitly stating what
> you think the consensus to be is remarkably likely to be proven wrong by
> subsequent events!
>
> Andrew
>
>
>

Re: codec changes

Posted by Andrew Stitcher <as...@redhat.com>.
On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote:
> We seem to have reached consensus here, but I haven't seen any commits on
> this. We should probably fix this before 0.10 so we don't end up putting
> out a new API and then deprecating it in the next release. Is anyone
> actually working on this?

Could you articulate the consensus then.

Asserting "we have reached consensus" without explicitly stating what
you think the consensus to be is remarkably likely to be proven wrong by
subsequent events!

Andrew



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
We seem to have reached consensus here, but I haven't seen any commits on
this. We should probably fix this before 0.10 so we don't end up putting
out a new API and then deprecating it in the next release. Is anyone
actually working on this?

--Rafael

On Wed, Apr 15, 2015 at 2:29 PM, Andrew Stitcher <as...@redhat.com>
wrote:

> On Wed, 2015-04-15 at 07:13 -0400, Rafael Schloming wrote:
> > On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway <ac...@redhat.com> wrote:
> >
> > > That works for me, now how do we manage the transition? I don't think
> we
> > > can afford to inflict "yum update proton; all proton apps crash" on our
> > > users. That means we cannot change the behavior of existing function
> > > names. I don't much like adding encode_foo2 for every encode_foo but I
> > > don't see what else to do.
> > >
> >
> > We could mark the old ones as deprecated and add the new ones as
> > pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to
> > pn_xxx_encode2. Then after a sufficient transition period we can get rid
> of
> > the old versions.
>
> Along these lines it is also possible to keep ABI (binary library
> compatibility) by using versioned library symbols as well. There is a
> level of linker magic needed, but at least on Unix it's well understood,
> if a little arcane.
>
> Another perfectly reasonable approach would be to create a new name for
> the new API and deprecate the old name.
>
> So for example deprecate pn_data_t in favour of pn_value_t (or whatever
> better but new name). Where pn_value_t has all the old functionality of
> pn_data_t and indeed may be the same internal structure initially, but
> with a different interface.
>
> Incidentally C++ does make this easier because it allows function
> overloading.
>
> Andrew
>
>
>

Re: codec changes

Posted by Andrew Stitcher <as...@redhat.com>.
On Wed, 2015-04-15 at 07:13 -0400, Rafael Schloming wrote:
> On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway <ac...@redhat.com> wrote:
> 
> > That works for me, now how do we manage the transition? I don't think we
> > can afford to inflict "yum update proton; all proton apps crash" on our
> > users. That means we cannot change the behavior of existing function
> > names. I don't much like adding encode_foo2 for every encode_foo but I
> > don't see what else to do.
> >
> 
> We could mark the old ones as deprecated and add the new ones as
> pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to
> pn_xxx_encode2. Then after a sufficient transition period we can get rid of
> the old versions.

Along these lines it is also possible to keep ABI (binary library
compatibility) by using versioned library symbols as well. There is a
level of linker magic needed, but at least on Unix it's well understood,
if a little arcane.

Another perfectly reasonable approach would be to create a new name for
the new API and deprecate the old name.

So for example deprecate pn_data_t in favour of pn_value_t (or whatever
better but new name). Where pn_value_t has all the old functionality of
pn_data_t and indeed may be the same internal structure initially, but
with a different interface.

Incidentally C++ does make this easier because it allows function
overloading.

Andrew



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway <ac...@redhat.com> wrote:

> That works for me, now how do we manage the transition? I don't think we
> can afford to inflict "yum update proton; all proton apps crash" on our
> users. That means we cannot change the behavior of existing function
> names. I don't much like adding encode_foo2 for every encode_foo but I
> don't see what else to do.
>

We could mark the old ones as deprecated and add the new ones as
pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to
pn_xxx_encode2. Then after a sufficient transition period we can get rid of
the old versions.

--Rafael

Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Sun, 2015-04-12 at 13:12 -0400, Rafael Schloming wrote:
> On Fri, Apr 10, 2015 at 11:37 AM, Alan Conway <ac...@redhat.com> wrote:
> 
> >
> > On Wed, 2015-04-08 at 14:50 -0400, Andrew Stitcher wrote:
> > > On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote:
> > > > ...
> > > > The issue is that we need buffer growth AND control over allocation.
> > > > pn_buffer_t forces use of malloc/free/realloc. That won't help if
> > you're
> > > > trying to get the data into a buffer allocated by Go for example. I
> > > > agree a growable buffer type is a nicer API than raw function pointers,
> > > > but the buffer type itself would need to use function pointers so we
> > can
> > > > replace the functions used for alloc/free/realloc.
> > >
> > > I think this is a pretty general issue actually that also crops up in
> > > embedded systems, where you want some different control over memory
> > > allocation.
> > >
> > > I think it might be better addressed by making malloc/free/realloc
> > > replaceable for the whole of proton - would that solve your go issue?
> >
> > Sadly no. A proton realloc surrogate is just going to be passed void*
> > and return void*. There's no memory safe way to implement that in Go,
> > unless someone keeps a Go pointer to the returned buffer it gets garbage
> > collected. There's no way to communicate a Go pointer thru a C proton
> > interface.
> >
> > Given all that, I think I prefer the sprintf-style solution for growing
> > buffers. It keeps the memory management strictly on the caller side of
> > the pn_ interface which is simpler.
> >
> > If there is a big pn_data overhaul in the works then maybe we should not
> > be talking about pn_data_encode and instead make sure we fix it in the
> > overhaul.
> >
> > I notice we have these other APIs:
> >
> > codec.h:499:PN_EXTERN int pn_data_format(pn_data_t *data, char *bytes,
> > size_t *size);
> > message.h:739:PN_EXTERN int pn_message_encode(pn_message_t *msg, char
> > *bytes, size_t *size);
> > ssl.h:320:PN_EXTERN int pn_ssl_get_peer_hostname( pn_ssl_t *ssl, char
> > *hostname, size_t *bufsize );
> >
> > That is a better signature IMO, the new (backward compatible) syntax for
> > such functions would be:
> >
> > return 0: *size is updated to be the actual size.
> > return PN_OVERFLOW: *size is updated to be the required size.
> > return < 0: *size is undefined.
> > don't return > 0
> >
> > Perhaps for the overhaul we should ensure all such functions (including
> > the replacement for pn_data_encode) follow this pattern?
> >
> 
> I'm a big +1 on standardizing on a single snprintf style signature for all
> encode functions.
> 
> The signatures you mention above (similar to sprintf but using an IN/OUT
> parameter) were my first attempt at a standard encode/decode signature. I
> agree with you that at first it looks a bit more appealing for some reason,
> however after using it for a while I actually found it quite cumbersome
> relative to the traditional snprintf style. In the common case it takes
> almost twice as much code to use it since you always have to declare,
> initialize, and check an extra variable in separate steps due to the IN/OUT
> parameter. It also doesn't map that nicely when you swig it since most
> languages don't do the IN/OUT parameter thing well, so you have to have
> more cumborsome/custom typemaps and/or wrapper code to deal with the
> impedance mismatch.
> 
> For those reasons I propose that we standardize on the traditional snprintf
> rather than using the IN/OUT variant.

That works for me, now how do we manage the transition? I don't think we
can afford to inflict "yum update proton; all proton apps crash" on our
users. That means we cannot change the behavior of existing function
names. I don't much like adding encode_foo2 for every encode_foo but I
don't see what else to do.



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
On Fri, Apr 10, 2015 at 11:37 AM, Alan Conway <ac...@redhat.com> wrote:

>
> On Wed, 2015-04-08 at 14:50 -0400, Andrew Stitcher wrote:
> > On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote:
> > > ...
> > > The issue is that we need buffer growth AND control over allocation.
> > > pn_buffer_t forces use of malloc/free/realloc. That won't help if
> you're
> > > trying to get the data into a buffer allocated by Go for example. I
> > > agree a growable buffer type is a nicer API than raw function pointers,
> > > but the buffer type itself would need to use function pointers so we
> can
> > > replace the functions used for alloc/free/realloc.
> >
> > I think this is a pretty general issue actually that also crops up in
> > embedded systems, where you want some different control over memory
> > allocation.
> >
> > I think it might be better addressed by making malloc/free/realloc
> > replaceable for the whole of proton - would that solve your go issue?
>
> Sadly no. A proton realloc surrogate is just going to be passed void*
> and return void*. There's no memory safe way to implement that in Go,
> unless someone keeps a Go pointer to the returned buffer it gets garbage
> collected. There's no way to communicate a Go pointer thru a C proton
> interface.
>
> Given all that, I think I prefer the sprintf-style solution for growing
> buffers. It keeps the memory management strictly on the caller side of
> the pn_ interface which is simpler.
>
> If there is a big pn_data overhaul in the works then maybe we should not
> be talking about pn_data_encode and instead make sure we fix it in the
> overhaul.
>
> I notice we have these other APIs:
>
> codec.h:499:PN_EXTERN int pn_data_format(pn_data_t *data, char *bytes,
> size_t *size);
> message.h:739:PN_EXTERN int pn_message_encode(pn_message_t *msg, char
> *bytes, size_t *size);
> ssl.h:320:PN_EXTERN int pn_ssl_get_peer_hostname( pn_ssl_t *ssl, char
> *hostname, size_t *bufsize );
>
> That is a better signature IMO, the new (backward compatible) syntax for
> such functions would be:
>
> return 0: *size is updated to be the actual size.
> return PN_OVERFLOW: *size is updated to be the required size.
> return < 0: *size is undefined.
> don't return > 0
>
> Perhaps for the overhaul we should ensure all such functions (including
> the replacement for pn_data_encode) follow this pattern?
>

I'm a big +1 on standardizing on a single snprintf style signature for all
encode functions.

The signatures you mention above (similar to sprintf but using an IN/OUT
parameter) were my first attempt at a standard encode/decode signature. I
agree with you that at first it looks a bit more appealing for some reason,
however after using it for a while I actually found it quite cumbersome
relative to the traditional snprintf style. In the common case it takes
almost twice as much code to use it since you always have to declare,
initialize, and check an extra variable in separate steps due to the IN/OUT
parameter. It also doesn't map that nicely when you swig it since most
languages don't do the IN/OUT parameter thing well, so you have to have
more cumborsome/custom typemaps and/or wrapper code to deal with the
impedance mismatch.

For those reasons I propose that we standardize on the traditional snprintf
rather than using the IN/OUT variant.

--Rafael

Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Wed, 2015-04-08 at 14:50 -0400, Andrew Stitcher wrote:
> On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote:
> > ...
> > The issue is that we need buffer growth AND control over allocation.
> > pn_buffer_t forces use of malloc/free/realloc. That won't help if you're
> > trying to get the data into a buffer allocated by Go for example. I
> > agree a growable buffer type is a nicer API than raw function pointers,
> > but the buffer type itself would need to use function pointers so we can
> > replace the functions used for alloc/free/realloc.
> 
> I think this is a pretty general issue actually that also crops up in
> embedded systems, where you want some different control over memory
> allocation.
> 
> I think it might be better addressed by making malloc/free/realloc
> replaceable for the whole of proton - would that solve your go issue?

Sadly no. A proton realloc surrogate is just going to be passed void*
and return void*. There's no memory safe way to implement that in Go,
unless someone keeps a Go pointer to the returned buffer it gets garbage
collected. There's no way to communicate a Go pointer thru a C proton
interface.

Given all that, I think I prefer the sprintf-style solution for growing
buffers. It keeps the memory management strictly on the caller side of
the pn_ interface which is simpler.

If there is a big pn_data overhaul in the works then maybe we should not
be talking about pn_data_encode and instead make sure we fix it in the
overhaul.

I notice we have these other APIs:

codec.h:499:PN_EXTERN int pn_data_format(pn_data_t *data, char *bytes, size_t *size);
message.h:739:PN_EXTERN int pn_message_encode(pn_message_t *msg, char *bytes, size_t *size);
ssl.h:320:PN_EXTERN int pn_ssl_get_peer_hostname( pn_ssl_t *ssl, char *hostname, size_t *bufsize );

That is a better signature IMO, the new (backward compatible) syntax for such functions would be:

return 0: *size is updated to be the actual size. 
return PN_OVERFLOW: *size is updated to be the required size.
return < 0: *size is undefined.
don't return > 0

Perhaps for the overhaul we should ensure all such functions (including
the replacement for pn_data_encode) follow this pattern?


Cheers,
Alan.


Re: codec changes

Posted by Andrew Stitcher <as...@redhat.com>.
On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote:
> ...
> The issue is that we need buffer growth AND control over allocation.
> pn_buffer_t forces use of malloc/free/realloc. That won't help if you're
> trying to get the data into a buffer allocated by Go for example. I
> agree a growable buffer type is a nicer API than raw function pointers,
> but the buffer type itself would need to use function pointers so we can
> replace the functions used for alloc/free/realloc.

I think this is a pretty general issue actually that also crops up in
embedded systems, where you want some different control over memory
allocation.

I think it might be better addressed by making malloc/free/realloc
replaceable for the whole of proton - would that solve your go issue?

Andrew



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
On Wed, Apr 8, 2015 at 10:48 AM, Alan Conway <ac...@redhat.com> wrote:

> On Tue, 2015-04-07 at 17:57 -0400, Rafael Schloming wrote:
> > Maybe I'm not following something, but I don't see how passing around
> > allocation functions actually solves any ownership problems. I would
> think
> > the ownership problems come from pn_data_t holding onto a pointer
> > regardless of whether that pointer was gotten from malloc or from a
> > callback. I'm assuming you want to be able to do something like use a
> > pn_data_t that is owned by one thread to encode into a buffer and pass
> that
> > buffer over to another thread. How does passing in a bunch of allocators
> > help with this? If the pn_data_t holds a pointer to whatever those
> > allocators return, then aren't you going to have ownership issues no
> matter
> > what?
> >
> > To answer Bozzo's original question, I think that it's good to keep the
> > encoder/decoder decoupled from the buffer for a number of reasons. In
> > addition to the ownership issues that Alan points out, the encoded data
> may
> > have a different lifecycle from the pn_data_t that created it for non
> > thread related reasons, or you may simply want to encode directly into a
> > frame buffer and avoid an extra copy.
> >
> > If the goal here is simply to provide a convenient way to avoid having to
> > repeat the resizing loop then I would suggest simply providing a
> > convenience API that accepts a growable buffer of some sort. This
> provides
> > both convenience and avoids ownership issues with holding pointers. I'm
> > thinking something along the lines of:
> >
> >     pn_data_t data = ...;
> >     pn_string/buffer_t buf = ...;
> >     pn_data_encode_conveniently(data, buf); // obviously needs a better
> name
> >
> > It is perhaps not *quite* as convenient in the minimal case as pn_data_t
> > holding the buffer internally, but it is an improvement in general and
> > probably simpler than having to mess with function pointers in the case
> > where the buffer's lifecycle is independent from the pn_data_t. (Not
> that I
> > really understand how that would work anyways.)
>
> The issue is that we need buffer growth AND control over allocation.
> pn_buffer_t forces use of malloc/free/realloc. That won't help if you're
> trying to get the data into a buffer allocated by Go for example. I
> agree a growable buffer type is a nicer API than raw function pointers,
> but the buffer type itself would need to use function pointers so we can
> replace the functions used for alloc/free/realloc.
>

I'm skeptical that passing around a set of function pointers whether
directly to pn_data_t or to pn_buffer_t is actually fewer lines of code
than simply writing the analogous convenience encoding function for that
language. For python it would likely be fewer lines of code to simply
define something like this:

    ssize_t pn_data_encode2bytearray(pn_data_t *data, PyObject *bytearray) {
        ssize_t size = pn_data_encode(data,
PyByteArray_AsString(bytearray), PyByteArray_Size(bytearray));
        if (size < 0) { return size; }
        if (size > PyByteArray_Size(bytearray)) {
            PyByteArray_Resize(bytearray, size);
            return pn_data_encode(data, PyByteArray_AsString(bytearray),
PyByteArray_Size(bytearray));
        }
    }

This would also be more flexible since you could pass in any bytearray
object from python, reuse it as much as you like, and have full control
over the lifecycle of that object, whereas I suspect given the signature of
the function pointers you are defining you would need to choose some sort
of predefined allocation strategy for creating whatever go object you are
encoding into. This sounds to me like it is both less convenient and less
flexible, but its entirely possible I'm just not understanding how you
intend to use this. Perhaps it would clarify things if you could post an
actual example usage of how you imagine using this that hopefully
illustrates why it is fewer lines of code than the alternatives?

--Rafael

Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Tue, 2015-04-07 at 17:57 -0400, Rafael Schloming wrote:
> Maybe I'm not following something, but I don't see how passing around
> allocation functions actually solves any ownership problems. I would think
> the ownership problems come from pn_data_t holding onto a pointer
> regardless of whether that pointer was gotten from malloc or from a
> callback. I'm assuming you want to be able to do something like use a
> pn_data_t that is owned by one thread to encode into a buffer and pass that
> buffer over to another thread. How does passing in a bunch of allocators
> help with this? If the pn_data_t holds a pointer to whatever those
> allocators return, then aren't you going to have ownership issues no matter
> what?
> 
> To answer Bozzo's original question, I think that it's good to keep the
> encoder/decoder decoupled from the buffer for a number of reasons. In
> addition to the ownership issues that Alan points out, the encoded data may
> have a different lifecycle from the pn_data_t that created it for non
> thread related reasons, or you may simply want to encode directly into a
> frame buffer and avoid an extra copy.
> 
> If the goal here is simply to provide a convenient way to avoid having to
> repeat the resizing loop then I would suggest simply providing a
> convenience API that accepts a growable buffer of some sort. This provides
> both convenience and avoids ownership issues with holding pointers. I'm
> thinking something along the lines of:
> 
>     pn_data_t data = ...;
>     pn_string/buffer_t buf = ...;
>     pn_data_encode_conveniently(data, buf); // obviously needs a better name
> 
> It is perhaps not *quite* as convenient in the minimal case as pn_data_t
> holding the buffer internally, but it is an improvement in general and
> probably simpler than having to mess with function pointers in the case
> where the buffer's lifecycle is independent from the pn_data_t. (Not that I
> really understand how that would work anyways.)

The issue is that we need buffer growth AND control over allocation.
pn_buffer_t forces use of malloc/free/realloc. That won't help if you're
trying to get the data into a buffer allocated by Go for example. I
agree a growable buffer type is a nicer API than raw function pointers,
but the buffer type itself would need to use function pointers so we can
replace the functions used for alloc/free/realloc.



Re: codec changes

Posted by Rafael Schloming <rh...@alum.mit.edu>.
Maybe I'm not following something, but I don't see how passing around
allocation functions actually solves any ownership problems. I would think
the ownership problems come from pn_data_t holding onto a pointer
regardless of whether that pointer was gotten from malloc or from a
callback. I'm assuming you want to be able to do something like use a
pn_data_t that is owned by one thread to encode into a buffer and pass that
buffer over to another thread. How does passing in a bunch of allocators
help with this? If the pn_data_t holds a pointer to whatever those
allocators return, then aren't you going to have ownership issues no matter
what?

To answer Bozzo's original question, I think that it's good to keep the
encoder/decoder decoupled from the buffer for a number of reasons. In
addition to the ownership issues that Alan points out, the encoded data may
have a different lifecycle from the pn_data_t that created it for non
thread related reasons, or you may simply want to encode directly into a
frame buffer and avoid an extra copy.

If the goal here is simply to provide a convenient way to avoid having to
repeat the resizing loop then I would suggest simply providing a
convenience API that accepts a growable buffer of some sort. This provides
both convenience and avoids ownership issues with holding pointers. I'm
thinking something along the lines of:

    pn_data_t data = ...;
    pn_string/buffer_t buf = ...;
    pn_data_encode_conveniently(data, buf); // obviously needs a better name

It is perhaps not *quite* as convenient in the minimal case as pn_data_t
holding the buffer internally, but it is an improvement in general and
probably simpler than having to mess with function pointers in the case
where the buffer's lifecycle is independent from the pn_data_t. (Not that I
really understand how that would work anyways.)

--Rafael


On Tue, Apr 7, 2015 at 1:21 PM, Alan Conway <ac...@redhat.com> wrote:

> On Tue, 2015-04-07 at 11:00 -0400, Andrew Stitcher wrote:
> > On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote:
> > > On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote:
> > > > Given the memory overhead of a pn_data_t before encoding, why not
> have it
> > > > own an encode buffer? it could get by with exactly that grow_buffer()
> > > > callback if ownership is the issue .
> > >
> >
> > I think the best way to do this would be to introduce a new class to sit
> > on top of the existing pn_data_t which does this, rather than extending
> > the current pn_data_t.
> >
> > So I think the below is fine, but I'd prefer to avoid stuffing it all
> > into pn_data_t especially as I think the class is, long term, going
> > away.
>
> Who's replacing it ;) ? Are they taking notes?
>
> Cheers,
> Alan.
>
>

Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Tue, 2015-04-07 at 11:00 -0400, Andrew Stitcher wrote:
> On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote:
> > On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote:
> > > Given the memory overhead of a pn_data_t before encoding, why not have it
> > > own an encode buffer? it could get by with exactly that grow_buffer()
> > > callback if ownership is the issue .
> > 
> 
> I think the best way to do this would be to introduce a new class to sit
> on top of the existing pn_data_t which does this, rather than extending
> the current pn_data_t.
> 
> So I think the below is fine, but I'd prefer to avoid stuffing it all
> into pn_data_t especially as I think the class is, long term, going
> away.

Who's replacing it ;) ? Are they taking notes?

Cheers,
Alan.


Re: codec changes

Posted by Andrew Stitcher <as...@redhat.com>.
On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote:
> On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote:
> > Given the memory overhead of a pn_data_t before encoding, why not have it
> > own an encode buffer? it could get by with exactly that grow_buffer()
> > callback if ownership is the issue .
> 

I think the best way to do this would be to introduce a new class to sit
on top of the existing pn_data_t which does this, rather than extending
the current pn_data_t.

So I think the below is fine, but I'd prefer to avoid stuffing it all
into pn_data_t especially as I think the class is, long term, going
away.

Andrew

> I like this idea a lot. Ownership is an issue for me in Go so I would
> want an alloc/free callback but we could have a non-callback version as
> well (simple wrapper for the callback version with default callbacks)
> for users that don't care about ownership - what do others think?
> Something like:
> 
> // if allocfn or freefn is NULL or allocfn returns NULL return EOS on out of data.
> ssize_t pn_data_encode_alloc(pn_data_t *data, char *bytes, size_t size,
>   void*(*allocfn)(size_t), void (*freefn)(void*));
> 
> // Compatibility
> ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size) {
>   return pn_data_encode_alloc(data, bytes, size, NULL, NULL);
> }
> 
> // Convenience when caller would like encode to do buffer management for them.
> // If *bytes is not null it is the initial buffer, if NULL encode allocates.
> // *bytes is updated to point to the final buffer which must be freed with free()
> int *pn_data_encode_grow(pn_data_t *data, char **bytes, size_t size) {
>   return pn_data_encode_alloc(data, bytes, size, pni_realloc, free)
> }
> 
> Note:
> - alloc called 0 or 1 times, and is passed the exact size needed to encode.
> - user is free to allocate more (e.g. doubling strategies)
> - free called after alloc, if and only if alloc is called.
> - separate free instead of single realloc call: encode impl decides
> if/how much data to copy from old to new buffer.
> 
> This can probably be made a bit simpler.
> 
> > Bozzo
> > On Mar 31, 2015 6:10 PM, "Rafael Schloming" <rh...@alum.mit.edu> wrote:
> > 
> > > Hi Alan,
> > >
> > > Sorry I didn't comment on this sooner, I didn't have time to comment on
> > > your original review request during my travels, however I do have some
> > > thoughts on the changes you made to the codec interface. I noticed you
> > > added a separate accessor for the size:
> > >
> > >     ssize_t pn_data_encoded_size(pn_data_t *data);
> > >
> > > This is alongside the original encode method:
> > >
> > >     ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size);
> > >
> > > I think this API choice while nice in that it is backwards compatible is
> > > also going to result in code that is roughly twice as slow as it needs to
> > > be in the most common case. Based on my experience implementing and
> > > profiling codec in C, Python, and Java, computing the size of the encoded
> > > data seems to usually be roughly the same amount of work as actually
> > > encoding it regardless of the implementation language. Therefore code like
> > > this:
> > >
> > >     if (buffer_size() < pn_data_encoded_size(data)) grow_buffer();
> > >     pn_data_encode(data, buffer, buffer_size());
> > >
> > > Can end up being roughly twice as slow as code like this:
> > >
> > >     ssize_t err;
> > >     while ((err = pn_data_encode(data, buffer, buffer_size())) ==
> > > PN_OVERFLOW) {
> > >         grow_buffer();
> > >     }
> > >
> > > Admittedly the latter form is much more awkward in those cases where you
> > > don't care about performance, so I'm all for providing something nicer, but
> > > I think a better API change would be to steal a page from the C stdio.h
> > > APIs and have pn_data_encode always return the number of bytes that would
> > > have been written had there been enough space. This allows you to write the
> > > simplified encode as above:
> > >
> > >     if (buffer_size() < pn_data_encode(data, NULL, 0)) grow_buffer();
> > >     pn_data_encode(data, buffer, buffer_size());
> > >
> > > Or use a more optimal form:
> > >
> > >    ssize_t n = pn_data_encode(data, buffer, buffer_size());
> > >    if (n > buffer_size()) {
> > >        grow_buffer();
> > >        pn_data_encode(data, buffer, buffer_size());
> > >    }
> > >
> > > This makes the slow/convenient form possible, and provides some options
> > > that are a bit less awkward than the loop, but it also makes it very clear
> > > that when you use the slow/convenient form you are incurring roughly twice
> > > the cost of the alternative.
> > >
> > > Normally I wouldn't be overly fussed by something like this, and I realize
> > > what I'm suggesting is a breaking change relative to what you provided, but
> > > based on what profiling we've done in the past, codec is probably the most
> > > significant source of overhead that we add to an application, and exactly
> > > this sort of double encode effect is almost always one of the first things
> > > you hit when you try to optimize. Given this, I think it would be a good
> > > thing if the API accurately reflects the relative cost of the different
> > > styles of use.
> > >
> > > Thoughts?
> > >
> > > --Rafael
> > >
> 
> 



Re: codec changes

Posted by Bozo Dragojevic <bo...@digiverse.si>.
On 7. 04. 15 15.38, Alan Conway wrote:
> On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote:
>> Given the memory overhead of a pn_data_t before encoding, why not have it
>> own an encode buffer? it could get by with exactly that grow_buffer()
>> callback if ownership is the issue .
> I like this idea a lot. Ownership is an issue for me in Go so I would
> want an alloc/free callback but we could have a non-callback version as
> well (simple wrapper for the callback version with default callbacks)
> for users that don't care about ownership - what do others think?
>

[snip]

> Note:
> - alloc called 0 or 1 times, and is passed the exact size needed to encode.

I would hate to have the encoder know a-priori the exact size of allocation.
Especially if we want to, at some point be able to generate a
{LIST|MAP|ARRAY}8 encodings.
I think we can aim for upper bound, but leave out exact-ness.

calculating encoding size for non-composite atoms is easy and cheap so
it could be moved from
encoding to pn_data_put_*() mutators. as long as uuid is part of atom
union, adding another int will not hurt :P
Calculating conservative size for composites is also cheap.

Maybe pn_data_t could even be made to encode as it goes and would thus
not need the extra buffer *and* copying for
variable-sized atoms (strings and binary) and each atom could be reduced
to a offset into the encoded buffer.


> - user is free to allocate more (e.g. doubling strategies)
> - free called after alloc, if and only if alloc is called.
> - separate free instead of single realloc call: encode impl decides
> if/how much data to copy from old to new buffer.

Partial copying of encoded buffers is also problematic, because encoder
has to backtrack to fill out byte sizes for structured types.
So at best we could do a scatter-gather thing, which quickly becomes
another mess as the pn_data_t then needs
a much better arithmetic for offset and size calculations. I really
think we should just offer exactly realloc() for now.


Bozzo



Re: codec changes

Posted by Alan Conway <ac...@redhat.com>.
On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote:
> Given the memory overhead of a pn_data_t before encoding, why not have it
> own an encode buffer? it could get by with exactly that grow_buffer()
> callback if ownership is the issue .

I like this idea a lot. Ownership is an issue for me in Go so I would
want an alloc/free callback but we could have a non-callback version as
well (simple wrapper for the callback version with default callbacks)
for users that don't care about ownership - what do others think?
Something like:

// if allocfn or freefn is NULL or allocfn returns NULL return EOS on out of data.
ssize_t pn_data_encode_alloc(pn_data_t *data, char *bytes, size_t size,
  void*(*allocfn)(size_t), void (*freefn)(void*));

// Compatibility
ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size) {
  return pn_data_encode_alloc(data, bytes, size, NULL, NULL);
}

// Convenience when caller would like encode to do buffer management for them.
// If *bytes is not null it is the initial buffer, if NULL encode allocates.
// *bytes is updated to point to the final buffer which must be freed with free()
int *pn_data_encode_grow(pn_data_t *data, char **bytes, size_t size) {
  return pn_data_encode_alloc(data, bytes, size, pni_realloc, free)
}

Note:
- alloc called 0 or 1 times, and is passed the exact size needed to encode.
- user is free to allocate more (e.g. doubling strategies)
- free called after alloc, if and only if alloc is called.
- separate free instead of single realloc call: encode impl decides
if/how much data to copy from old to new buffer.

This can probably be made a bit simpler.

> Bozzo
> On Mar 31, 2015 6:10 PM, "Rafael Schloming" <rh...@alum.mit.edu> wrote:
> 
> > Hi Alan,
> >
> > Sorry I didn't comment on this sooner, I didn't have time to comment on
> > your original review request during my travels, however I do have some
> > thoughts on the changes you made to the codec interface. I noticed you
> > added a separate accessor for the size:
> >
> >     ssize_t pn_data_encoded_size(pn_data_t *data);
> >
> > This is alongside the original encode method:
> >
> >     ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size);
> >
> > I think this API choice while nice in that it is backwards compatible is
> > also going to result in code that is roughly twice as slow as it needs to
> > be in the most common case. Based on my experience implementing and
> > profiling codec in C, Python, and Java, computing the size of the encoded
> > data seems to usually be roughly the same amount of work as actually
> > encoding it regardless of the implementation language. Therefore code like
> > this:
> >
> >     if (buffer_size() < pn_data_encoded_size(data)) grow_buffer();
> >     pn_data_encode(data, buffer, buffer_size());
> >
> > Can end up being roughly twice as slow as code like this:
> >
> >     ssize_t err;
> >     while ((err = pn_data_encode(data, buffer, buffer_size())) ==
> > PN_OVERFLOW) {
> >         grow_buffer();
> >     }
> >
> > Admittedly the latter form is much more awkward in those cases where you
> > don't care about performance, so I'm all for providing something nicer, but
> > I think a better API change would be to steal a page from the C stdio.h
> > APIs and have pn_data_encode always return the number of bytes that would
> > have been written had there been enough space. This allows you to write the
> > simplified encode as above:
> >
> >     if (buffer_size() < pn_data_encode(data, NULL, 0)) grow_buffer();
> >     pn_data_encode(data, buffer, buffer_size());
> >
> > Or use a more optimal form:
> >
> >    ssize_t n = pn_data_encode(data, buffer, buffer_size());
> >    if (n > buffer_size()) {
> >        grow_buffer();
> >        pn_data_encode(data, buffer, buffer_size());
> >    }
> >
> > This makes the slow/convenient form possible, and provides some options
> > that are a bit less awkward than the loop, but it also makes it very clear
> > that when you use the slow/convenient form you are incurring roughly twice
> > the cost of the alternative.
> >
> > Normally I wouldn't be overly fussed by something like this, and I realize
> > what I'm suggesting is a breaking change relative to what you provided, but
> > based on what profiling we've done in the past, codec is probably the most
> > significant source of overhead that we add to an application, and exactly
> > this sort of double encode effect is almost always one of the first things
> > you hit when you try to optimize. Given this, I think it would be a good
> > thing if the API accurately reflects the relative cost of the different
> > styles of use.
> >
> > Thoughts?
> >
> > --Rafael
> >



Re: codec changes

Posted by Božo Dragojevič <bo...@digiverse.si>.
Given the memory overhead of a pn_data_t before encoding, why not have it
own an encode buffer? it could get by with exactly that grow_buffer()
callback if ownership is the issue .

Bozzo
On Mar 31, 2015 6:10 PM, "Rafael Schloming" <rh...@alum.mit.edu> wrote:

> Hi Alan,
>
> Sorry I didn't comment on this sooner, I didn't have time to comment on
> your original review request during my travels, however I do have some
> thoughts on the changes you made to the codec interface. I noticed you
> added a separate accessor for the size:
>
>     ssize_t pn_data_encoded_size(pn_data_t *data);
>
> This is alongside the original encode method:
>
>     ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size);
>
> I think this API choice while nice in that it is backwards compatible is
> also going to result in code that is roughly twice as slow as it needs to
> be in the most common case. Based on my experience implementing and
> profiling codec in C, Python, and Java, computing the size of the encoded
> data seems to usually be roughly the same amount of work as actually
> encoding it regardless of the implementation language. Therefore code like
> this:
>
>     if (buffer_size() < pn_data_encoded_size(data)) grow_buffer();
>     pn_data_encode(data, buffer, buffer_size());
>
> Can end up being roughly twice as slow as code like this:
>
>     ssize_t err;
>     while ((err = pn_data_encode(data, buffer, buffer_size())) ==
> PN_OVERFLOW) {
>         grow_buffer();
>     }
>
> Admittedly the latter form is much more awkward in those cases where you
> don't care about performance, so I'm all for providing something nicer, but
> I think a better API change would be to steal a page from the C stdio.h
> APIs and have pn_data_encode always return the number of bytes that would
> have been written had there been enough space. This allows you to write the
> simplified encode as above:
>
>     if (buffer_size() < pn_data_encode(data, NULL, 0)) grow_buffer();
>     pn_data_encode(data, buffer, buffer_size());
>
> Or use a more optimal form:
>
>    ssize_t n = pn_data_encode(data, buffer, buffer_size());
>    if (n > buffer_size()) {
>        grow_buffer();
>        pn_data_encode(data, buffer, buffer_size());
>    }
>
> This makes the slow/convenient form possible, and provides some options
> that are a bit less awkward than the loop, but it also makes it very clear
> that when you use the slow/convenient form you are incurring roughly twice
> the cost of the alternative.
>
> Normally I wouldn't be overly fussed by something like this, and I realize
> what I'm suggesting is a breaking change relative to what you provided, but
> based on what profiling we've done in the past, codec is probably the most
> significant source of overhead that we add to an application, and exactly
> this sort of double encode effect is almost always one of the first things
> you hit when you try to optimize. Given this, I think it would be a good
> thing if the API accurately reflects the relative cost of the different
> styles of use.
>
> Thoughts?
>
> --Rafael
>