You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Dimitri Vorona <al...@googlemail.com> on 2018/04/11 07:02:52 UTC

Buffer slices are unsafe

Hi everybody,

to continue the discussion in [0]: right now this [1] can happen and the
sliced buffer has no way to foresee or to check against it beforehand.

I'd suggest to create a new class SlicedBuffer, which would reference the
parent buffer and return it's data() pointer, insted of grabbing the
pointer at creation. We can the parent reference a weak pointer, so the
existing SliceBuffers don't prevent the deallocation (or maybe even have a
separate WeakSliceBuffer class).

It would also make the mentioned PR way safer to merge, in my opinion.

Cheers,
Dimitri.

[0]: https://github.com/apache/arrow/pull/1769
[1]: https://gist.github.com/alendit/f759bc12e03d9dd9d72cac90a6334cc5

Re: Buffer slices are unsafe

Posted by Wes McKinney <we...@gmail.com>.
My feeling is that we should advise users of the library that any
slices of a ResizableBuffer become invalid after a call to Resize.

> I was thinking about something like this [0]. The point is, that the slice
> user has no way of knowing if the slice can still be safely used and who
> owns the memory.

You can look at the Buffer parent to see if there is a parent-child
relationship, which at least tells you whether you definitely do _not_
own the memory.

I'm not convinced from this use case that we need to change the way
that the Buffer abstraction works. If there is a need for memory
ownership-nannying, that may be best handled by some other kind of
abstract interface that uses Buffers for its implementation.

- Wes

On Wed, Apr 11, 2018 at 8:05 AM, Antoine Pitrou <an...@python.org> wrote:
>
> Hi Dimitri,
>
> Le 11/04/2018 à 13:42, Dimitri Vorona a écrit :
>>
>> I was thinking about something like this [0]. The point is, that the slice
>> user has no way of knowing if the slice can still be safely used and who
>> owns the memory.
>
> I think the answer is that calling free() on something you exported to
> consumers is incorrect.  If you allocate buffers, you should choose a
> Buffer implementation with proper ownership semantics.  For example, we
> have PoolBuffer, but also Python buffers and CUDA buffers.  They all
> (should) have proper ownership.  If you want to create buffers with data
> managed with malloc/free, you need to write a MallocBuffer implementation.
>
>> A step back is a good idea. My use case would be to return a partially
>> built slice on a buffer, while continuing appending to the buffer. Think
>> delta dictionaries: while a slice of the coding table can be sent, we will
>> have additional data to append later on.
>
> I don't know anything about delta dictionaries, but I get the idea.
>
> Does the implementation become harder if you split the coding table into
> several buffers that never get resized?
>
>> To build on your previous proposal: maybe some more finely grained locking
>> mechanism, like the data_ being a shared_ptr<uint_8*>, slices grabbing a
>> copy of it when they want to use it and releasing it afterwards? The parent
>> would then check the couter of the shared_ptr (similar to the number of
>> slices).
>
> You need an actual lock to avoid race conditions (the parent may find a
> zero shared_ptr counter, but another thread would grab a data pointer
> immediately after).
>
> I wonder if we really want such implementation complexity.  Also,
> everyone is now paying the price of locking.  Ideally slicing and
> fetching a data pointer should be cheap.  I'd like to know what others
> think about this.
>
> Regards
>
> Antoine.

Re: Buffer slices are unsafe

Posted by Antoine Pitrou <an...@python.org>.
Hi Dimitri,

Le 11/04/2018 à 13:42, Dimitri Vorona a écrit :
> 
> I was thinking about something like this [0]. The point is, that the slice
> user has no way of knowing if the slice can still be safely used and who
> owns the memory.

I think the answer is that calling free() on something you exported to
consumers is incorrect.  If you allocate buffers, you should choose a
Buffer implementation with proper ownership semantics.  For example, we
have PoolBuffer, but also Python buffers and CUDA buffers.  They all
(should) have proper ownership.  If you want to create buffers with data
managed with malloc/free, you need to write a MallocBuffer implementation.

> A step back is a good idea. My use case would be to return a partially
> built slice on a buffer, while continuing appending to the buffer. Think
> delta dictionaries: while a slice of the coding table can be sent, we will
> have additional data to append later on.

I don't know anything about delta dictionaries, but I get the idea.

Does the implementation become harder if you split the coding table into
several buffers that never get resized?

> To build on your previous proposal: maybe some more finely grained locking
> mechanism, like the data_ being a shared_ptr<uint_8*>, slices grabbing a
> copy of it when they want to use it and releasing it afterwards? The parent
> would then check the couter of the shared_ptr (similar to the number of
> slices).

You need an actual lock to avoid race conditions (the parent may find a
zero shared_ptr counter, but another thread would grab a data pointer
immediately after).

I wonder if we really want such implementation complexity.  Also,
everyone is now paying the price of locking.  Ideally slicing and
fetching a data pointer should be cheap.  I'd like to know what others
think about this.

Regards

Antoine.

Re: Buffer slices are unsafe

Posted by Dimitri Vorona <al...@googlemail.com>.
Hi Antoine,

I was thinking about something like this [0]. The point is, that the slice
user has no way of knowing if the slice can still be safely used and who
owns the memory.

You are right, of course, that it wouldn't be thread safe and we'd need a
locking mechanism which prevents de-/reallocations while there is an active
slice.

A step back is a good idea. My use case would be to return a partially
built slice on a buffer, while continuing appending to the buffer. Think
delta dictionaries: while a slice of the coding table can be sent, we will
have additional data to append later on. Right now it is handled by two
buffers with copying between them, it could be implemented more efficiently
with slices with a clear concept of ownership, I think.

To build on your previous proposal: maybe some more finely grained locking
mechanism, like the data_ being a shared_ptr<uint_8*>, slices grabbing a
copy of it when they want to use it and releasing it afterwards? The parent
would then check the couter of the shared_ptr (similar to the number of
slices). That way an inactive slice wouldn't block the parent from
resizing.

Cheers,
Dimitri.

[0]: https://gist.github.com/alendit/00e4bd5f0e8a79e8ff9ebd86995f3905

On Wed, Apr 11, 2018 at 12:35 PM, Antoine Pitrou <an...@python.org> wrote:

>
> Hi Dimitri,
>
> Le 11/04/2018 à 12:28, Dimitri Vorona a écrit :
> >
> > I think, it comes down to the memory ownership. While Buffer apparently
> > never owns it's memory (based on the doc string), a MutableBuffer can. So
> > if you slice a MutableBuffer, and the memory gets deallocated, you've got
> > the same problem.
>
> I may be misunderstanding, but that sounds unlikely.  Can you post a
> concrete example?
>
> > Your proposed solution seems to be a safe one, we could even use
> > shared_ptr::use_counts. But I'd rather invert the responsibility: a
> parent
> > buffer can do with it's memory as it pleases, the slices get a way of
> > checking if they are still valid.
>
> That's inherently incorrect with multiple threads.
>
> I think we should step back a bit: what's the use case, and what's the
> minimal way to satisfy that use case without opening a whole can of worms?
>
> Regards
>
> Antoine.
>

Re: Buffer slices are unsafe

Posted by Antoine Pitrou <an...@python.org>.
Hi Dimitri,

Le 11/04/2018 à 12:28, Dimitri Vorona a écrit :
> 
> I think, it comes down to the memory ownership. While Buffer apparently
> never owns it's memory (based on the doc string), a MutableBuffer can. So
> if you slice a MutableBuffer, and the memory gets deallocated, you've got
> the same problem.

I may be misunderstanding, but that sounds unlikely.  Can you post a
concrete example?

> Your proposed solution seems to be a safe one, we could even use
> shared_ptr::use_counts. But I'd rather invert the responsibility: a parent
> buffer can do with it's memory as it pleases, the slices get a way of
> checking if they are still valid.

That's inherently incorrect with multiple threads.

I think we should step back a bit: what's the use case, and what's the
minimal way to satisfy that use case without opening a whole can of worms?

Regards

Antoine.

Re: Buffer slices are unsafe

Posted by Dimitri Vorona <al...@googlemail.com>.
Hi Antoine,

> AFAIU, the problem only exists with ResizableBuffer?

I think, it comes down to the memory ownership. While Buffer apparently
never owns it's memory (based on the doc string), a MutableBuffer can. So
if you slice a MutableBuffer, and the memory gets deallocated, you've got
the same problem.

Your proposed solution seems to be a safe one, we could even use
shared_ptr::use_counts. But I'd rather invert the responsibility: a parent
buffer can do with it's memory as it pleases, the slices get a way of
checking if they are still valid. Maybe we could even include a safe cast
to void*, uint8_t* and char* for the buffers (which checks buffers
validity), so to discourage reusing of a stale data pointer (like the
raw_data* in some ArrayBuilders). So instead of memcpy(buffer.data(),
source, N) you'd just use memcpy(buffer, source, N).

So the logic would basically be: a Buffer either owns it's memory or is a
slice (view). If it's a slice, then it has a way of checking if it's still
valid (and hasn't shrunk so the view is not valid anymore). If the slice is
built with a raw pointer, then all bets are off, but we should mark such
cases as super unsafe, i.e. UnsafeBuffer.

Maybe I'm overthinking it, but I can imagine, that it'll come to bite us,
when the code base grows.

Cheers,
Dimitri.



On Wed, Apr 11, 2018 at 11:19 AM, Antoine Pitrou <an...@python.org> wrote:

>
> Hi Dimitri,
>
> Le 11/04/2018 à 09:02, Dimitri Vorona a écrit :
> > Hi everybody,
> >
> > to continue the discussion in [0]: right now this [1] can happen and the
> > sliced buffer has no way to foresee or to check against it beforehand.
> >
> > I'd suggest to create a new class SlicedBuffer, which would reference the
> > parent buffer and return it's data() pointer, insted of grabbing the
> > pointer at creation. We can the parent reference a weak pointer, so the
> > existing SliceBuffers don't prevent the deallocation (or maybe even have
> a
> > separate WeakSliceBuffer class).
>
> AFAIU, the problem only exists with ResizableBuffer?
>
> We could do the following:
>
> - add a new Buffer member counting the number of exports (default 0)
>
> - in SliceBuffer, increment the parent's number of exports
> - in ~Buffer, decrement the parent's number of exports
> - in PoolBuffer::Resize(), return an error if the buffer has any exports
>
> (this is how Python buffers work, for the record)
>
> What do you think?
>
> Regards
>
> Antoine.
>

Re: Buffer slices are unsafe

Posted by Antoine Pitrou <an...@python.org>.
Hi Dimitri,

Le 11/04/2018 à 09:02, Dimitri Vorona a écrit :
> Hi everybody,
> 
> to continue the discussion in [0]: right now this [1] can happen and the
> sliced buffer has no way to foresee or to check against it beforehand.
> 
> I'd suggest to create a new class SlicedBuffer, which would reference the
> parent buffer and return it's data() pointer, insted of grabbing the
> pointer at creation. We can the parent reference a weak pointer, so the
> existing SliceBuffers don't prevent the deallocation (or maybe even have a
> separate WeakSliceBuffer class).

AFAIU, the problem only exists with ResizableBuffer?

We could do the following:

- add a new Buffer member counting the number of exports (default 0)

- in SliceBuffer, increment the parent's number of exports
- in ~Buffer, decrement the parent's number of exports
- in PoolBuffer::Resize(), return an error if the buffer has any exports

(this is how Python buffers work, for the record)

What do you think?

Regards

Antoine.