You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yevgeny Pats <yp...@cloudquery.io> on 2023/03/03 10:53:34 UTC

Extensions Type Interface Discussion

Hey folks,

Hopefully this is the right place to ask. As some background I'm Yevgeny
Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @
CloudQuery <https://github.com/cloudquery/cloudquery> . We are very
interested in migrating our protocol and Go type system to Apache Arrow.
Extensions are a critical part for us and thus I've the following questions
on whether it's a usage problem on my end or something that is not yet
available. I'll give here an example for Go but I believe the same issue
exists in all libraries/languages.

Here is a public github gist
<https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>.

What are the problems:

- The problems are around the abstraction for the extension types. While I
understand that the underlying storage needs to be supported in the library
we don't have a way for extensions to provide its own builder which means
the user needs to know how the extension type stores the type inside the
binary. This creates a leaky abstraction and the need for various helper
functions like `UUIDToBinary`
- The other way is fine as you can have methods like ToUUID on top of the
extension array. But this creates asymmetry in the abstraction.
- Because we don't control the builder for extensions this cripples into
other places like json
<https://github.com/apache/arrow/issues/34292#issuecomment-1446653210> and
csv where we can't control marshalling (in the same way we control all
other built-in types). So basically for extensions that use binary type as
underlying storage in case of json and csv those will always be encoded as
base64 which is not very useful (think about uuid, ip address, mac address).

The main point is that I think the right abstraction for extensions should
provide all the apis (type, array, builder) just like built-in types,
otherwise the abstraction is incomplete or "leaky". Of course we can still
have limitations like the custom builder must use an underlying known
storage (for it to work over ipc) but it can still control various other
types like marshaling, unmarshaling, building, and so on.

Hopefully this gives enough context but would love to elaborate.

Thanks,
Yevgeny

Re: Extensions Type Interface Discussion

Posted by Yevgeny Pats <yp...@cloudquery.io>.
Hi Andrew, Thanks for the reply!

I did exactly that and considered first to see if we can start by only
handling it in the application level but that's a no go for us to migrate
to arrow (from our own type system) as this basically removes a lot of the
benefits such as the built-in csv writer, parquet and bunch of other things
that we will need to implement on our own and also this will create a
suboptimal experience (worse than the current one we have, hence we can't
migrate) for us and anyone building cloudquery plugins and using our SDK.

I created a PR <https://github.com/apache/arrow/pull/34454> for the Go
implementation already with an example of how we intended
<https://github.com/cloudquery/filetypes/tree/main/internal/cqarrow> to use
it.

Already got great feedback from Matt Topol. Any more feedback and ideas are
welcome. If this abstraction would work well I think other languages might
benefit from that (though for us right now we only use Go).

On Mon, Mar 6, 2023 at 2:08 PM Andrew Lamb <al...@influxdata.com> wrote:

> Hi Yevgeny,
>
> It is great you are thinking of using Arrow.
>
> > - The problems are around the abstraction for the extension types. While
> I
> understand that the underlying storage needs to be supported in the library
> we don't have a way for extensions to provide its own builder which means
> the user needs to know how the extension type stores the type inside the
> binary. This creates a leaky abstraction and the need for various helper
> functions like `UUIDToBinary`
>
> I don't have anything specific to offer in terms of the Go implementation.
>
> However, In terms of helping define a better abstraction, one way you might
> proceed is to forgo using the library support for extension types and
> implement support for your custom types yourself in your application code.
> Once you have figured out the most useful APIs, then perhaps you could
> propose contributing them to the arrow Go implementation.
>
> Andrew
>
>
>
>
>
>
> On Fri, Mar 3, 2023 at 5:54 AM Yevgeny Pats <yp...@cloudquery.io> wrote:
>
> > Hey folks,
> >
> > Hopefully this is the right place to ask. As some background I'm Yevgeny
> > Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @
> > CloudQuery <https://github.com/cloudquery/cloudquery> . We are very
> > interested in migrating our protocol and Go type system to Apache Arrow.
> > Extensions are a critical part for us and thus I've the following
> questions
> > on whether it's a usage problem on my end or something that is not yet
> > available. I'll give here an example for Go but I believe the same issue
> > exists in all libraries/languages.
> >
> > Here is a public github gist
> > <https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>.
> >
> > What are the problems:
> >
> > - The problems are around the abstraction for the extension types. While
> I
> > understand that the underlying storage needs to be supported in the
> library
> > we don't have a way for extensions to provide its own builder which means
> > the user needs to know how the extension type stores the type inside the
> > binary. This creates a leaky abstraction and the need for various helper
> > functions like `UUIDToBinary`
> > - The other way is fine as you can have methods like ToUUID on top of the
> > extension array. But this creates asymmetry in the abstraction.
> > - Because we don't control the builder for extensions this cripples into
> > other places like json
> > <https://github.com/apache/arrow/issues/34292#issuecomment-1446653210>
> and
> > csv where we can't control marshalling (in the same way we control all
> > other built-in types). So basically for extensions that use binary type
> as
> > underlying storage in case of json and csv those will always be encoded
> as
> > base64 which is not very useful (think about uuid, ip address, mac
> > address).
> >
> > The main point is that I think the right abstraction for extensions
> should
> > provide all the apis (type, array, builder) just like built-in types,
> > otherwise the abstraction is incomplete or "leaky". Of course we can
> still
> > have limitations like the custom builder must use an underlying known
> > storage (for it to work over ipc) but it can still control various other
> > types like marshaling, unmarshaling, building, and so on.
> >
> > Hopefully this gives enough context but would love to elaborate.
> >
> > Thanks,
> > Yevgeny
> >
>

Re: Extensions Type Interface Discussion

Posted by Andrew Lamb <al...@influxdata.com>.
Hi Yevgeny,

It is great you are thinking of using Arrow.

> - The problems are around the abstraction for the extension types. While I
understand that the underlying storage needs to be supported in the library
we don't have a way for extensions to provide its own builder which means
the user needs to know how the extension type stores the type inside the
binary. This creates a leaky abstraction and the need for various helper
functions like `UUIDToBinary`

I don't have anything specific to offer in terms of the Go implementation.

However, In terms of helping define a better abstraction, one way you might
proceed is to forgo using the library support for extension types and
implement support for your custom types yourself in your application code.
Once you have figured out the most useful APIs, then perhaps you could
propose contributing them to the arrow Go implementation.

Andrew






On Fri, Mar 3, 2023 at 5:54 AM Yevgeny Pats <yp...@cloudquery.io> wrote:

> Hey folks,
>
> Hopefully this is the right place to ask. As some background I'm Yevgeny
> Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @
> CloudQuery <https://github.com/cloudquery/cloudquery> . We are very
> interested in migrating our protocol and Go type system to Apache Arrow.
> Extensions are a critical part for us and thus I've the following questions
> on whether it's a usage problem on my end or something that is not yet
> available. I'll give here an example for Go but I believe the same issue
> exists in all libraries/languages.
>
> Here is a public github gist
> <https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>.
>
> What are the problems:
>
> - The problems are around the abstraction for the extension types. While I
> understand that the underlying storage needs to be supported in the library
> we don't have a way for extensions to provide its own builder which means
> the user needs to know how the extension type stores the type inside the
> binary. This creates a leaky abstraction and the need for various helper
> functions like `UUIDToBinary`
> - The other way is fine as you can have methods like ToUUID on top of the
> extension array. But this creates asymmetry in the abstraction.
> - Because we don't control the builder for extensions this cripples into
> other places like json
> <https://github.com/apache/arrow/issues/34292#issuecomment-1446653210> and
> csv where we can't control marshalling (in the same way we control all
> other built-in types). So basically for extensions that use binary type as
> underlying storage in case of json and csv those will always be encoded as
> base64 which is not very useful (think about uuid, ip address, mac
> address).
>
> The main point is that I think the right abstraction for extensions should
> provide all the apis (type, array, builder) just like built-in types,
> otherwise the abstraction is incomplete or "leaky". Of course we can still
> have limitations like the custom builder must use an underlying known
> storage (for it to work over ipc) but it can still control various other
> types like marshaling, unmarshaling, building, and so on.
>
> Hopefully this gives enough context but would love to elaborate.
>
> Thanks,
> Yevgeny
>