You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Todd Lipcon <to...@cloudera.com.INVALID> on 2020/04/06 16:18:18 UTC

C interface clarifications

Hey folks,

I've started working on a patch to make Apache Kudu's C++ client able to
expose batches of data in Arrow's new C-style interface (
https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst
)

I had a couple questions / items that should be clarified in the spec. Wes
suggested I raise them here on dev@:

*1) Should producers expect callers to zero-init structs?*

The spec suggests that producers have an interface like:

Status Produce(ArrowArray* array) {
  ...
}

In the case of Arrow's own producer implementation, it doesnt assume that
'array' has been initialized in any way prior to this call, and the first
thing it does is zero the memory of 'array'. This is pretty standard
behavior in C-style APIs (eg stat(2) doesn't assume that its out-argument
is initialized in any way)

 An alternate approach would be to assume that 'array' is in some valid
state, and c all array->release() if it is non-null prior to filling in the
array with new data. This is a more C++-style API: in C++ it's rare to have
uninitialized structures floating around because constructors usually put
objects into some kind of valid state before the object gets passed
anywhere.

The answer here is probably "up to you", but might be good to have some
guidance here in the spec doc. I suppose since it's the "C interface" it's
probably best to follow the C-style "producer assumes the argument contains
uninitialized memory" convention.

*2) Clarify lifetime semantics for nested structures*

In my application, i'd like to allocate all of the children structures of
an ArrowSchema or ArrowArray out of a memory pool which is stored in the
private_data field of the top-level struct. As such, my initial
implementation was to make the 'release' callback on the top-level struct
delete that memory pool, and set the 'release' callback of all children
structs to null, since their memory was totally owned by the top-level
struct.

I figured this approach was OK because the spec says:

>  Consumers MUST call a base structure's release callback when they won't
be using it anymore, but they MUST not call any of its children's release
callbacks (including the optional dictionary). The producer is responsible
for releasing the children.

That advice seems to indicate that I can do whatever I want with the
release callback of the children, including not setting it.

However, I found that arrow's ImportArray function would fail a check
because the child structures had no release callbacks set. I had to set the
release callbacks to a no-op function to work around this.

This section of the spec also seems to be a bit in conflict with the
following:

> It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child array
anymore. This satisfies the use case of keeping only a subset of child
arrays, while releasing the others.

... because if you have a parent array which owns the memory referred to by
the child, then moving the child (with a no-op release callback) followed
by releasing the parent, you'll end up with an invalid or deallocated child
as well.

In other words, I think the spec should be explicit that either:
(a) every allocated structure should "stand alone" and be individually
releasable (and thus moveable)
(b) a produced struct must have the same lifetime as all children.
Consumers should not release children, and if they release the original
base, all children are invalidated regardless of whether they have been
moved.


Thanks
Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: C interface clarifications

Posted by Wes McKinney <we...@gmail.com>.

I opened a JIRA to track a potential change or at least clarification
about this use case. One major use case for the C interface will be in
database clients (e.g. this question arose out of using the C
interface for Kudu -- a database) and this may be a common question.

https://issues.apache.org/jira/browse/ARROW-8368

On Tue, Apr 7, 2020 at 12:46 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 07/04/2020 à 19:39, Wes McKinney a écrit :
> >
> > Re-orienting the discussion on something more concrete, suppose that an
> > ArrowArray is used to convey a result set from a database query, and
> > suppose that the resources associated with each column in the result set
> > are independent of the other columns. Should it not be possible to select a
> > subset of the columns and release the resources of the columns that are not
> > needed?
>
> Perhaps.  Though you should probably write the query so as to only
> select those columns in the first place ("SELECT *" is a well-known
> anti-pattern).
>
> That said, if there's a real need, I'm open to relaxing the
> specification a bit, such that you may move one or several children
> before immediately releasing the parent...
>
> Regards
>
> Antoine.

Re: C interface clarifications

Posted by Antoine Pitrou <an...@python.org>.

Le 07/04/2020 à 19:39, Wes McKinney a écrit :
> 
> Re-orienting the discussion on something more concrete, suppose that an
> ArrowArray is used to convey a result set from a database query, and
> suppose that the resources associated with each column in the result set
> are independent of the other columns. Should it not be possible to select a
> subset of the columns and release the resources of the columns that are not
> needed?

Perhaps.  Though you should probably write the query so as to only
select those columns in the first place ("SELECT *" is a well-known
anti-pattern).

That said, if there's a real need, I'm open to relaxing the
specification a bit, such that you may move one or several children
before immediately releasing the parent...

Regards

Antoine.

Re: C interface clarifications

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Apr 7, 2020, 12:04 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 07/04/2020 à 18:49, Todd Lipcon a écrit :
> >>
> >> Hmm, the spec may not be clear enough on this, but if you move a child
> >> and release the parent, then the other children are not usable anymore.
> >>
> >> In your case, you don't call release() on every child.  You just call
> >> release() on the parent once you are done with all children.  So the
> >> synchronization has to be managed on the consumer side, the producer
> >> doesn't see any of it.
> >>
> >
> > I'm not sure I follow. If the consumer takes a top-level ArrowArray, and
> > then moves out each of the children, then calls release() on the
> top-level
> > array, the expectation is that the chlidren stay valid, according to the
> > spec.
>
> According to the spec, once you move a child, you are supposed to
> release the parent ("immediately afterwards").  Then by construction you
> cannot move other children.  I don't know if we want to make it possible
> to move several children before releasing the parent.  Abstractly, that
> may sound desirable, but we should check that doesn't impose further
> constraints on the producer.
>

Re-orienting the discussion on something more concrete, suppose that an
ArrowArray is used to convey a result set from a database query, and
suppose that the resources associated with each column in the result set
are independent of the other columns. Should it not be possible to select a
subset of the columns and release the resources of the columns that are not
needed?


> However, if as a consumer you're simply processing all children, then
> there's no need to move them.  Just release the parent when you are
> finished with all children.
>
> Regards
>
> Antoine.
>

Re: C interface clarifications

Posted by Antoine Pitrou <an...@python.org>.

Le 07/04/2020 à 18:49, Todd Lipcon a écrit :
>>
>> Hmm, the spec may not be clear enough on this, but if you move a child
>> and release the parent, then the other children are not usable anymore.
>>
>> In your case, you don't call release() on every child.  You just call
>> release() on the parent once you are done with all children.  So the
>> synchronization has to be managed on the consumer side, the producer
>> doesn't see any of it.
>>
> 
> I'm not sure I follow. If the consumer takes a top-level ArrowArray, and
> then moves out each of the children, then calls release() on the top-level
> array, the expectation is that the chlidren stay valid, according to the
> spec.

According to the spec, once you move a child, you are supposed to
release the parent ("immediately afterwards").  Then by construction you
cannot move other children.  I don't know if we want to make it possible
to move several children before releasing the parent.  Abstractly, that
may sound desirable, but we should check that doesn't impose further
constraints on the producer.

However, if as a consumer you're simply processing all children, then
there's no need to move them.  Just release the parent when you are
finished with all children.

Regards

Antoine.

Re: C interface clarifications

Posted by Todd Lipcon <to...@cloudera.com.INVALID>.

On Tue, Apr 7, 2020 at 2:40 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 06/04/2020 à 19:22, Todd Lipcon a écrit :
> >
> > The spec should also probably cover thread-safety: if the consumer gets
> an
> > ArrowArray, is it safe to pass off the children to multiple threads and
> > have them call release() concurrently? In other words, do I need to use a
> > thread-safe reference count? I would guess so.
>
> Hmm, the spec may not be clear enough on this, but if you move a child
> and release the parent, then the other children are not usable anymore.
>
> In your case, you don't call release() on every child.  You just call
> release() on the parent once you are done with all children.  So the
> synchronization has to be managed on the consumer side, the producer
> doesn't see any of it.
>

I'm not sure I follow. If the consumer takes a top-level ArrowArray, and
then moves out each of the children, then calls release() on the top-level
array, the expectation is that the chlidren stay valid, according to the
spec. Then the consumer is responsible to call release() on each of the
children, eventually.

The question here is whether the release() callback on different children
needs to be thread-safe (eg by using a shared_ptr and not a non-atomic
reference count). Again it sounds like yes, even though in the most common
use case, everything will happen on a single thread.

-Todd

> (and of course, the simplest way to do that in C++ may be to use a
> shared_ptr to an object holding the ArrowArray struct)
>

Yea, I'm doing more or less this, but using a manual atomic reference count
so that I can do a single atomic operation in the case that release() is
called on the parent, rather than doing O(n_children):
https://gist.github.com/toddlipcon/5d5c6629dd29bc6975dd509e75424454

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: C interface clarifications

Posted by Antoine Pitrou <an...@python.org>.

Le 06/04/2020 à 19:22, Todd Lipcon a écrit :
> 
> The spec should also probably cover thread-safety: if the consumer gets an
> ArrowArray, is it safe to pass off the children to multiple threads and
> have them call release() concurrently? In other words, do I need to use a
> thread-safe reference count? I would guess so.

Hmm, the spec may not be clear enough on this, but if you move a child
and release the parent, then the other children are not usable anymore.

In your case, you don't call release() on every child.  You just call
release() on the parent once you are done with all children.  So the
synchronization has to be managed on the consumer side, the producer
doesn't see any of it.
(and of course, the simplest way to do that in C++ may be to use a
shared_ptr to an object holding the ArrowArray struct)

Regards

Antoine.

Re: C interface clarifications

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Apr 6, 2020 at 12:22 PM Todd Lipcon <to...@cloudera.com.invalid> wrote:
>
> On Mon, Apr 6, 2020 at 9:57 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Hello Todd,
> >
> > Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> > >
> > > I had a couple questions / items that should be clarified in the spec.
> > Wes
> > > suggested I raise them here on dev@:
> > >
> > > *1) Should producers expect callers to zero-init structs?*
> >
> > IMO, they shouldn't.  They should fill the structure exhaustively.
> > Though ultimately it's a decision made by the implementer of the
> > producer API.
> >
> > > I suppose since it's the "C interface" it's
> > > probably best to follow the C-style "producer assumes the argument
> > contains
> > > uninitialized memory" convention.
> >
> > Exactly.
> >
> > > *2) Clarify lifetime semantics for nested structures*
> > >
> > > In my application, i'd like to allocate all of the children structures of
> > > an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> > > private_data field of the top-level struct. As such, my initial
> > > implementation was to make the 'release' callback on the top-level struct
> > > delete that memory pool, and set the 'release' callback of all children
> > > structs to null, since their memory was totally owned by the top-level
> > > struct.
> > >
> > > I figured this approach was OK because the spec says:
> > >
> > >>  Consumers MUST call a base structure's release callback when they won't
> > > be using it anymore, but they MUST not call any of its children's release
> > > callbacks (including the optional dictionary). The producer is
> > responsible
> > > for releasing the children.
> > >
> > > That advice seems to indicate that I can do whatever I want with the
> > > release callback of the children, including not setting it.
> >
> > ... Except that in this case, moving a child wouldn't be possible.
> >
> > > This section of the spec also seems to be a bit in conflict with the
> > > following:
> > >
> > >> It is possible to move a child array, but the parent array MUST be
> > > released immediately afterwards, as it won't point to a valid child array
> > > anymore. This satisfies the use case of keeping only a subset of child
> > > arrays, while releasing the others.
> > >
> > > ... because if you have a parent array which owns the memory referred to
> > by
> > > the child, then moving the child (with a no-op release callback) followed
> > > by releasing the parent, you'll end up with an invalid or deallocated
> > child
> > > as well.
> >
> > I think the solution here is for your memory pool to be
> > reference-counted, and have each release callback in the array tree
> > decrement the reference count.  Does that sound reasonable to you?
> >
>
> Sure, I can do that. But I imagine consumers may also be a bit surprised by
> this behavior, that releasing the children doesn't actually free up any
> memory.

This should be documented in the producer API, I think, that even if
you move the children out of the parent that the common resource
persists beyond the parents' release() invocation. It may vary between
producer implementations of the C interface.

> The spec should also probably cover thread-safety: if the consumer gets an
> ArrowArray, is it safe to pass off the children to multiple threads and
> have them call release() concurrently? In other words, do I need to use a
> thread-safe reference count? I would guess so.

Yes, e.g. std::shared_ptr or similar. I agree that the spec should
address or at least provide a strong recommendation regarding
thread-safety of resources shared by distinct ArrowArray structures in
their private_data. While in many scenarios the top-level release
callback will handle destruction and any related thread-safety issues,
as soon as any children are moved out of the parent, their release
callbacks could be called at many time.

So in the spec it says

"It is possible to move a child array, but the parent array MUST be
released immediately afterwards, as it won't point to a valid child
array anymore. This satisfies the use case of keeping only a subset of
child arrays, while releasing the others."

we should add a sentence like "It is recommended that producers take
thread-safety into consideration and ensure that moved child arrays'
release callbacks can be called in a concurrent setting."

> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: C interface clarifications

Posted by Todd Lipcon <to...@cloudera.com.INVALID>.

On Mon, Apr 6, 2020 at 9:57 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hello Todd,
>
> Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> >
> > I had a couple questions / items that should be clarified in the spec.
> Wes
> > suggested I raise them here on dev@:
> >
> > *1) Should producers expect callers to zero-init structs?*
>
> IMO, they shouldn't.  They should fill the structure exhaustively.
> Though ultimately it's a decision made by the implementer of the
> producer API.
>
> > I suppose since it's the "C interface" it's
> > probably best to follow the C-style "producer assumes the argument
> contains
> > uninitialized memory" convention.
>
> Exactly.
>
> > *2) Clarify lifetime semantics for nested structures*
> >
> > In my application, i'd like to allocate all of the children structures of
> > an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> > private_data field of the top-level struct. As such, my initial
> > implementation was to make the 'release' callback on the top-level struct
> > delete that memory pool, and set the 'release' callback of all children
> > structs to null, since their memory was totally owned by the top-level
> > struct.
> >
> > I figured this approach was OK because the spec says:
> >
> >>  Consumers MUST call a base structure's release callback when they won't
> > be using it anymore, but they MUST not call any of its children's release
> > callbacks (including the optional dictionary). The producer is
> responsible
> > for releasing the children.
> >
> > That advice seems to indicate that I can do whatever I want with the
> > release callback of the children, including not setting it.
>
> ... Except that in this case, moving a child wouldn't be possible.
>
> > This section of the spec also seems to be a bit in conflict with the
> > following:
> >
> >> It is possible to move a child array, but the parent array MUST be
> > released immediately afterwards, as it won't point to a valid child array
> > anymore. This satisfies the use case of keeping only a subset of child
> > arrays, while releasing the others.
> >
> > ... because if you have a parent array which owns the memory referred to
> by
> > the child, then moving the child (with a no-op release callback) followed
> > by releasing the parent, you'll end up with an invalid or deallocated
> child
> > as well.
>
> I think the solution here is for your memory pool to be
> reference-counted, and have each release callback in the array tree
> decrement the reference count.  Does that sound reasonable to you?
>

Sure, I can do that. But I imagine consumers may also be a bit surprised by
this behavior, that releasing the children doesn't actually free up any
memory.

The spec should also probably cover thread-safety: if the consumer gets an
ArrowArray, is it safe to pass off the children to multiple threads and
have them call release() concurrently? In other words, do I need to use a
thread-safe reference count? I would guess so.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: C interface clarifications

Posted by Antoine Pitrou <an...@python.org>.

Hello Todd,

Le 06/04/2020 à 18:18, Todd Lipcon a écrit :
> 
> I had a couple questions / items that should be clarified in the spec. Wes
> suggested I raise them here on dev@:
> 
> *1) Should producers expect callers to zero-init structs?*

IMO, they shouldn't.  They should fill the structure exhaustively.
Though ultimately it's a decision made by the implementer of the
producer API.

> I suppose since it's the "C interface" it's
> probably best to follow the C-style "producer assumes the argument contains
> uninitialized memory" convention.

Exactly.

> *2) Clarify lifetime semantics for nested structures*
> 
> In my application, i'd like to allocate all of the children structures of
> an ArrowSchema or ArrowArray out of a memory pool which is stored in the
> private_data field of the top-level struct. As such, my initial
> implementation was to make the 'release' callback on the top-level struct
> delete that memory pool, and set the 'release' callback of all children
> structs to null, since their memory was totally owned by the top-level
> struct.
> 
> I figured this approach was OK because the spec says:
> 
>>  Consumers MUST call a base structure's release callback when they won't
> be using it anymore, but they MUST not call any of its children's release
> callbacks (including the optional dictionary). The producer is responsible
> for releasing the children.
> 
> That advice seems to indicate that I can do whatever I want with the
> release callback of the children, including not setting it.

... Except that in this case, moving a child wouldn't be possible.

> This section of the spec also seems to be a bit in conflict with the
> following:
> 
>> It is possible to move a child array, but the parent array MUST be
> released immediately afterwards, as it won't point to a valid child array
> anymore. This satisfies the use case of keeping only a subset of child
> arrays, while releasing the others.
> 
> ... because if you have a parent array which owns the memory referred to by
> the child, then moving the child (with a no-op release callback) followed
> by releasing the parent, you'll end up with an invalid or deallocated child
> as well.

I think the solution here is for your memory pool to be
reference-counted, and have each release callback in the array tree
decrement the reference count.  Does that sound reasonable to you?

Best regards

Antoine.