You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by vertexclique vertexclique <ve...@gmail.com> on 2020/10/08 13:10:10 UTC

[Rust]: Exposed API

Hi;

Let me start with my aim and how things are evolved in my mind.
Through extensive usage of Arrow API, I've realized that we are doing so
many unnecessary allocations and rebuilding for simple things like offset
changes. (At least that's what I am doing).

That said, it is tough to make the tradeoff of iterator overhead in
reconstruction, and other extra bits come with the ArrayData and Array
construction. I see that tests are also so long because of the
reconstruction of the intermediate results.

Use case 1, below code won't do something:

        std::mem::swap(&mut child_data.offset(), &mut 40);

Due to private fields, such as the simple operation mentioned above, that
will enable the developer for advanced cases, is blocked.

I propose the following:

There is a feature gate macro that exposes fields to enable doing this:

        std::mem::swap(&mut child_data.offset, &mut 40);

Macro will check the feature called `*exposed*` to enable conditional
compilation for fields.
This can be for anything. That said, we put a disclaimer in the README
about the exposed API that it shouldn't be used unless you know what you
are doing.

An important part of this, that it will enable so many things from the
performance perspective. Which we can also internally use when the exposed
feature is enabled.

What do you think of it? If you feel good about it, I want to incorporate
this into the codebase asap.

Best,
Mahmut Bulut

Re: [Rust]: Exposed API

Posted by Andrew Lamb <al...@influxdata.com>.
Thanks for the response Mahmut,

I don't think I have a lot more to add


On Sat, Oct 10, 2020 at 8:18 AM Vertexclique <ve...@gmail.com> wrote:

> Hi Andrew,
>
> > I wonder if you can describe at a higher level what you are doing that
> requires so many allocations or rebuildings. The example you provide of
> modifying the underlying offset pointer seems a little strange to me as I
> thought one of the architectural goals of those structures was to be
> immutable.
>
> Sure. Vectorized processing kernels that I am using need a rebuild of
> buffers continuously. Various intermediate arrays are destroyed after I am
> done with it, which doesn't need to build intermediate arrays. For my case
> immutability shouldn't come with extra cost. And for most of the database
> systems, that is exactly how it is.
>
> Architecture's goal is immutable structs. That is for sure. But, for some
> cases, you don't need immutability for intermediate results. Moreover,
> allocating immutable data once to working on it later is the right approach
> for some cases.
>
>
> > It might also help to show/explain some examples of what types of
> performance improvements would be enabled.
>
> It is hard to show in a single email here. But I will try my best to
> explain; doing CAS for processed intermediate results and hoist operations
> for the same type of vectors, having hot and cold arrays, migrations,
> scratch pads are a couple of optimizations for it.
>
> >  Depending on what exactly you are doing, I wonder if you could use the
> Rust
> unsafe API for your advanced use-cases rather than having to extend arrow
> itself.
>
> For using unsafe, you need to be able to access to the pointer. In the
> example I have sent before, there is nothing that enables it. That is what
> exposed API is all about. With a feature gate making buffer fields public
> at will.
>
> If we don't make it public, people need to copy codebase and make things
> public (or transmute) at their will and create new types to add their
> methods to do what I have explained. This is much more cumbersome than
> exposing pointers to outside with gates.
>
> Since encapsulation exposure will always be feature gated and disabled by
> default, there is no harm to current immutability and encapsulation.
>
> Best,
> Mahmut
>
>
> On Oct 10, 2020, 13:32, at 13:32, Andrew Lamb <al...@influxdata.com>
> wrote:
> >Hi Mahmut,
> >
> >I wonder if you can describe at a higher level what you are doing that
> >requires so many allocations or rebuildings. The example you provide of
> >modifying the underlying offset pointer seems a little strange to me as
> >I
> >thought one of the architectural goals of  those structures was to be
> >immutable.
> >
> >It might also help to show/explain some examples of what types of
> >performance improvements would be enabled.
> >
> >Depending on what exactly you are doing, I wonder if you could use the
> >Rust
> >unsafe API for your advanced usecases rather than having to extend
> >arrow
> >itself.
> >
> >Andrew
> >
> >On Thu, Oct 8, 2020 at 9:10 AM vertexclique vertexclique <
> >vertexclique@gmail.com> wrote:
> >
> >> Hi;
> >>
> >> Let me start with my aim and how things are evolved in my mind.
> >> Through extensive usage of Arrow API, I've realized that we are doing
> >so
> >> many unnecessary allocations and rebuilding for simple things like
> >offset
> >> changes. (At least that's what I am doing).
> >>
> >> That said, it is tough to make the tradeoff of iterator overhead in
> >> reconstruction, and other extra bits come with the ArrayData and
> >Array
> >> construction. I see that tests are also so long because of the
> >> reconstruction of the intermediate results.
> >>
> >> Use case 1, below code won't do something:
> >>
> >>         std::mem::swap(&mut child_data.offset(), &mut 40);
> >>
> >> Due to private fields, such as the simple operation mentioned above,
> >that
> >> will enable the developer for advanced cases, is blocked.
> >>
> >> I propose the following:
> >>
> >> There is a feature gate macro that exposes fields to enable doing
> >this:
> >>
> >>         std::mem::swap(&mut child_data.offset, &mut 40);
> >>
> >> Macro will check the feature called `*exposed*` to enable conditional
> >> compilation for fields.
> >> This can be for anything. That said, we put a disclaimer in the
> >README
> >> about the exposed API that it shouldn't be used unless you know what
> >you
> >> are doing.
> >>
> >> An important part of this, that it will enable so many things from
> >the
> >> performance perspective. Which we can also internally use when the
> >exposed
> >> feature is enabled.
> >>
> >> What do you think of it? If you feel good about it, I want to
> >incorporate
> >> this into the codebase asap.
> >>
> >> Best,
> >> Mahmut Bulut
> >>
>

Re: [Rust]: Exposed API

Posted by Vertexclique <ve...@gmail.com>.
Hi Andrew,

> I wonder if you can describe at a higher level what you are doing that
requires so many allocations or rebuildings. The example you provide of
modifying the underlying offset pointer seems a little strange to me as I
thought one of the architectural goals of those structures was to be
immutable.

Sure. Vectorized processing kernels that I am using need a rebuild of buffers continuously. Various intermediate arrays are destroyed after I am done with it, which doesn't need to build intermediate arrays. For my case immutability shouldn't come with extra cost. And for most of the database systems, that is exactly how it is.

Architecture's goal is immutable structs. That is for sure. But, for some cases, you don't need immutability for intermediate results. Moreover, allocating immutable data once to working on it later is the right approach for some cases.


> It might also help to show/explain some examples of what types of
performance improvements would be enabled.

It is hard to show in a single email here. But I will try my best to explain; doing CAS for processed intermediate results and hoist operations for the same type of vectors, having hot and cold arrays, migrations, scratch pads are a couple of optimizations for it.

>  Depending on what exactly you are doing, I wonder if you could use the Rust
unsafe API for your advanced use-cases rather than having to extend arrow
itself.

For using unsafe, you need to be able to access to the pointer. In the example I have sent before, there is nothing that enables it. That is what exposed API is all about. With a feature gate making buffer fields public at will.

If we don't make it public, people need to copy codebase and make things public (or transmute) at their will and create new types to add their methods to do what I have explained. This is much more cumbersome than exposing pointers to outside with gates.

Since encapsulation exposure will always be feature gated and disabled by default, there is no harm to current immutability and encapsulation.

Best,
Mahmut


On Oct 10, 2020, 13:32, at 13:32, Andrew Lamb <al...@influxdata.com> wrote:
>Hi Mahmut,
>
>I wonder if you can describe at a higher level what you are doing that
>requires so many allocations or rebuildings. The example you provide of
>modifying the underlying offset pointer seems a little strange to me as
>I
>thought one of the architectural goals of  those structures was to be
>immutable.
>
>It might also help to show/explain some examples of what types of
>performance improvements would be enabled.
>
>Depending on what exactly you are doing, I wonder if you could use the
>Rust
>unsafe API for your advanced usecases rather than having to extend
>arrow
>itself.
>
>Andrew
>
>On Thu, Oct 8, 2020 at 9:10 AM vertexclique vertexclique <
>vertexclique@gmail.com> wrote:
>
>> Hi;
>>
>> Let me start with my aim and how things are evolved in my mind.
>> Through extensive usage of Arrow API, I've realized that we are doing
>so
>> many unnecessary allocations and rebuilding for simple things like
>offset
>> changes. (At least that's what I am doing).
>>
>> That said, it is tough to make the tradeoff of iterator overhead in
>> reconstruction, and other extra bits come with the ArrayData and
>Array
>> construction. I see that tests are also so long because of the
>> reconstruction of the intermediate results.
>>
>> Use case 1, below code won't do something:
>>
>>         std::mem::swap(&mut child_data.offset(), &mut 40);
>>
>> Due to private fields, such as the simple operation mentioned above,
>that
>> will enable the developer for advanced cases, is blocked.
>>
>> I propose the following:
>>
>> There is a feature gate macro that exposes fields to enable doing
>this:
>>
>>         std::mem::swap(&mut child_data.offset, &mut 40);
>>
>> Macro will check the feature called `*exposed*` to enable conditional
>> compilation for fields.
>> This can be for anything. That said, we put a disclaimer in the
>README
>> about the exposed API that it shouldn't be used unless you know what
>you
>> are doing.
>>
>> An important part of this, that it will enable so many things from
>the
>> performance perspective. Which we can also internally use when the
>exposed
>> feature is enabled.
>>
>> What do you think of it? If you feel good about it, I want to
>incorporate
>> this into the codebase asap.
>>
>> Best,
>> Mahmut Bulut
>>

Re: [Rust]: Exposed API

Posted by Andrew Lamb <al...@influxdata.com>.
Hi Mahmut,

I wonder if you can describe at a higher level what you are doing that
requires so many allocations or rebuildings. The example you provide of
modifying the underlying offset pointer seems a little strange to me as I
thought one of the architectural goals of  those structures was to be
immutable.

It might also help to show/explain some examples of what types of
performance improvements would be enabled.

Depending on what exactly you are doing, I wonder if you could use the Rust
unsafe API for your advanced usecases rather than having to extend arrow
itself.

Andrew

On Thu, Oct 8, 2020 at 9:10 AM vertexclique vertexclique <
vertexclique@gmail.com> wrote:

> Hi;
>
> Let me start with my aim and how things are evolved in my mind.
> Through extensive usage of Arrow API, I've realized that we are doing so
> many unnecessary allocations and rebuilding for simple things like offset
> changes. (At least that's what I am doing).
>
> That said, it is tough to make the tradeoff of iterator overhead in
> reconstruction, and other extra bits come with the ArrayData and Array
> construction. I see that tests are also so long because of the
> reconstruction of the intermediate results.
>
> Use case 1, below code won't do something:
>
>         std::mem::swap(&mut child_data.offset(), &mut 40);
>
> Due to private fields, such as the simple operation mentioned above, that
> will enable the developer for advanced cases, is blocked.
>
> I propose the following:
>
> There is a feature gate macro that exposes fields to enable doing this:
>
>         std::mem::swap(&mut child_data.offset, &mut 40);
>
> Macro will check the feature called `*exposed*` to enable conditional
> compilation for fields.
> This can be for anything. That said, we put a disclaimer in the README
> about the exposed API that it shouldn't be used unless you know what you
> are doing.
>
> An important part of this, that it will enable so many things from the
> performance perspective. Which we can also internally use when the exposed
> feature is enabled.
>
> What do you think of it? If you feel good about it, I want to incorporate
> this into the codebase asap.
>
> Best,
> Mahmut Bulut
>