You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Radu Teodorescu <ra...@yahoo.com.INVALID> on 2020/06/26 16:33:09 UTC

Deep copy for ArrayData,Array, Table in C++ API

(Light weigh topic this time)
Are there any existing functions for deep copying Array,ArrayData or Table objects in the C++ API?
Ultimately, I am trying to get a bunch of sparse row ranges from a ranges into a contiguous new Table - I can see how I can copy Buffer and I can implement it all myself, but I am trying to make sure I am not reinventing the wheel.

Thank you
Radu

Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Jun 29, 2020 at 9:33 AM Radu Teodorescu
<ra...@yahoo.com.invalid> wrote:
>
> Yes,
> I am set for what I need at the moment but since I went for a deepish dive into the current API, and this has been a recurring use case over the year I would extend a few proposals, for expanding Take:
> 1. Add support for packed indices - three avenues:
>         a) expand Datum.Kind to allow for PackedIndex: a sequence of individual indices and ranges as in 3,7,1,[4-10),2,[30-100) which can be represented as an Array<int>: { 3,7,1,4,-10,2,30,-100}
>         b) use and additional flag signaling the index argument (of any type fungible to an int sequence) is in fact a packed index represented as above
>         c) have an explicit contention where, the type of the index is signed, it is expected to be a packed index

This sounds like a new function altogether, not an expansion of the
existing one. I don't think we should overload the existing
compute::Take function nor add new types to Datum. I am not sure it
needs to fit within the algebra of the kernels framework at the
moment, but this can always be changed later if it makes sense. Much
easier to add things than take them away.

> 2. Add explicit control for result chunk size: Since the result of Take is typically (always?) allocated inside the kernel, we can and an argument that specifies the size of each allocated chunk (in bytes or in rows - I lean toward rows) , and that can be applied any Datum type of the values, not only ChunkedArray.

This can be handled by options to the new function.

> What’s the best way to push this forward? Free discussion, votes, tickets? I am happy to work on the actual solution once we agree on one (all of the above should be fairly straight forward).
> Cheers
> Radu
>
>
> > On Jun 27, 2020, at 6:23 PM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Efficiently assembling a selection from multiple arrays will require
> > some care -- our current implementation of Take involving ChunkedArray
> > arguments is not too efficient, and they will need some rewriting for
> > efficiency at some point in the future. Using some combination of
> > Concatenate and Take may yield a working solution but probably not a
> > computationally optimal one
> >
> > On Fri, Jun 26, 2020 at 3:07 PM Antoine Pitrou <so...@pitrou.net> wrote:
> >>
> >> On Fri, 26 Jun 2020 13:56:26 -0400
> >> Radu Teodorescu <ra...@yahoo.com.INVALID> wrote:
> >>> Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job.
> >>
> >> Yes, you could Slice the array and then Concatenate the slices.
> >> Note that slicing will keep the entire buffers alive, not only the
> >> range that's being sliced, so it might be suboptimal if you only
> >> keep a small part of the original values.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
>

Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Radu Teodorescu <ra...@yahoo.com.INVALID>.
Yes,
I am set for what I need at the moment but since I went for a deepish dive into the current API, and this has been a recurring use case over the year I would extend a few proposals, for expanding Take:
1. Add support for packed indices - three avenues:
	a) expand Datum.Kind to allow for PackedIndex: a sequence of individual indices and ranges as in 3,7,1,[4-10),2,[30-100) which can be represented as an Array<int>: { 3,7,1,4,-10,2,30,-100}
	b) use and additional flag signaling the index argument (of any type fungible to an int sequence) is in fact a packed index represented as above
	c) have an explicit contention where, the type of the index is signed, it is expected to be a packed index

2. Add explicit control for result chunk size: Since the result of Take is typically (always?) allocated inside the kernel, we can and an argument that specifies the size of each allocated chunk (in bytes or in rows - I lean toward rows) , and that can be applied any Datum type of the values, not only ChunkedArray.

What’s the best way to push this forward? Free discussion, votes, tickets? I am happy to work on the actual solution once we agree on one (all of the above should be fairly straight forward).
Cheers
Radu


> On Jun 27, 2020, at 6:23 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> Efficiently assembling a selection from multiple arrays will require
> some care -- our current implementation of Take involving ChunkedArray
> arguments is not too efficient, and they will need some rewriting for
> efficiency at some point in the future. Using some combination of
> Concatenate and Take may yield a working solution but probably not a
> computationally optimal one
> 
> On Fri, Jun 26, 2020 at 3:07 PM Antoine Pitrou <so...@pitrou.net> wrote:
>> 
>> On Fri, 26 Jun 2020 13:56:26 -0400
>> Radu Teodorescu <ra...@yahoo.com.INVALID> wrote:
>>> Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job.
>> 
>> Yes, you could Slice the array and then Concatenate the slices.
>> Note that slicing will keep the entire buffers alive, not only the
>> range that's being sliced, so it might be suboptimal if you only
>> keep a small part of the original values.
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 


Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Wes McKinney <we...@gmail.com>.
Efficiently assembling a selection from multiple arrays will require
some care -- our current implementation of Take involving ChunkedArray
arguments is not too efficient, and they will need some rewriting for
efficiency at some point in the future. Using some combination of
Concatenate and Take may yield a working solution but probably not a
computationally optimal one

On Fri, Jun 26, 2020 at 3:07 PM Antoine Pitrou <so...@pitrou.net> wrote:
>
> On Fri, 26 Jun 2020 13:56:26 -0400
> Radu Teodorescu <ra...@yahoo.com.INVALID> wrote:
> > Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job.
>
> Yes, you could Slice the array and then Concatenate the slices.
> Note that slicing will keep the entire buffers alive, not only the
> range that's being sliced, so it might be suboptimal if you only
> keep a small part of the original values.
>
> Regards
>
> Antoine.
>
>

Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Antoine Pitrou <so...@pitrou.net>.
On Fri, 26 Jun 2020 13:56:26 -0400
Radu Teodorescu <ra...@yahoo.com.INVALID> wrote:
> Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job.  

Yes, you could Slice the array and then Concatenate the slices.
Note that slicing will keep the entire buffers alive, not only the
range that's being sliced, so it might be suboptimal if you only
keep a small part of the original values.

Regards

Antoine.



Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Radu Teodorescu <ra...@yahoo.com.INVALID>.
Looks like Concatenate is my best bet if I am looking at putting together ranges, certainly doesn’t look as neatly packaged as Take, but this might be the right tool for this job.  

> On Jun 26, 2020, at 1:01 PM, Radu Teodorescu <ra...@yahoo.com.INVALID> wrote:
> 
> That is fabulous and pretty much it!
> Follow up questions:
> 1. Is there any efficient way to refer to ranges: say I want to take rows 1000-2000 and 4000-5000, feels unwieldy to have to create an index array of 2000 elements and then also the underlying implementation would be less efficient having to iterate over those indices an copy elements one by one, rather than do two memcopies and be done.
> 2. Is the chunk structure being preserved - the deprecated API suggests that if the input is a ChunkedArray the output is also a ChunkedArray - my hope and expectation was to pic ranges from multiple Array objects drop them all into a single Array and I am not sure the Datum version does that
> 3. Is it possible to append to an existing Array (or Datum in general) rather than have a new Array/Datum produced by the Take function.
> 
> Thanks a lot for the quick response (and for all awesome work you guys have been putting into this project)
> Radu
> 
> 
> 
>> On Jun 26, 2020, at 12:39 PM, Micah Kornfield <em...@gmail.com> wrote:
>> 
>> This sounds like the Take kernel?
>> 
>> On Friday, June 26, 2020, Radu Teodorescu <ra...@yahoo.com.invalid>
>> wrote:
>> 
>>> (Light weigh topic this time)
>>> Are there any existing functions for deep copying Array,ArrayData or Table
>>> objects in the C++ API?
>>> Ultimately, I am trying to get a bunch of sparse row ranges from a ranges
>>> into a contiguous new Table - I can see how I can copy Buffer and I can
>>> implement it all myself, but I am trying to make sure I am not reinventing
>>> the wheel.
>>> 
>>> Thank you
>>> Radu
> 


Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Radu Teodorescu <ra...@yahoo.com.INVALID>.
That is fabulous and pretty much it!
Follow up questions:
1. Is there any efficient way to refer to ranges: say I want to take rows 1000-2000 and 4000-5000, feels unwieldy to have to create an index array of 2000 elements and then also the underlying implementation would be less efficient having to iterate over those indices an copy elements one by one, rather than do two memcopies and be done.
2. Is the chunk structure being preserved - the deprecated API suggests that if the input is a ChunkedArray the output is also a ChunkedArray - my hope and expectation was to pic ranges from multiple Array objects drop them all into a single Array and I am not sure the Datum version does that
3. Is it possible to append to an existing Array (or Datum in general) rather than have a new Array/Datum produced by the Take function.

Thanks a lot for the quick response (and for all awesome work you guys have been putting into this project)
Radu



> On Jun 26, 2020, at 12:39 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> This sounds like the Take kernel?
> 
> On Friday, June 26, 2020, Radu Teodorescu <ra...@yahoo.com.invalid>
> wrote:
> 
>> (Light weigh topic this time)
>> Are there any existing functions for deep copying Array,ArrayData or Table
>> objects in the C++ API?
>> Ultimately, I am trying to get a bunch of sparse row ranges from a ranges
>> into a contiguous new Table - I can see how I can copy Buffer and I can
>> implement it all myself, but I am trying to make sure I am not reinventing
>> the wheel.
>> 
>> Thank you
>> Radu


Re: Deep copy for ArrayData,Array, Table in C++ API

Posted by Micah Kornfield <em...@gmail.com>.
This sounds like the Take kernel?

On Friday, June 26, 2020, Radu Teodorescu <ra...@yahoo.com.invalid>
wrote:

> (Light weigh topic this time)
> Are there any existing functions for deep copying Array,ArrayData or Table
> objects in the C++ API?
> Ultimately, I am trying to get a bunch of sparse row ranges from a ranges
> into a contiguous new Table - I can see how I can copy Buffer and I can
> implement it all myself, but I am trying to make sure I am not reinventing
> the wheel.
>
> Thank you
> Radu