You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Matthew Topol <mt...@factset.com> on 2021/08/23 14:07:56 UTC

[C++][Go] CGO For Dataset API Integration

Hey All,

So I've been working on a use case where I needed to be able to use the Dataset API from Golang and instead of trying to port all of it to Golang (which would require porting the Compute side too) I decided to create a proof of concept using CGO to just call into the existing C++ code in a similar fashion to how the Java solution is using JNI for the same thing. After proving to myself it works I came up with a question that I figured would be best sent to this mailing list.

When building it out, CGO just needs a C-API exposed for it to work and while there is a C Data interface designed for using Arrow, there is not currently a C Data Interface designed for the Dataset API. As a result, the big question is that if I wanted to contribute the work to the Arrow Repo, should a C Interface for the Dataset API be put as a separate directory and separate build artifact like the JNI interface, or should it just be directly added to and exported from the Dataset library? It's an organizational question because either way it would need to exist on anywhere that the Go code that wants to hit it would be being built, so it's the difference between just needing libarrow_dataset.so (and it's dependencies) or needing that *and* libarrow_dataset_cgo.so/.a, etc.

I'm curious what everyone's opinions might be on this so I can get an idea of which direction I should go before trying to put a PR together.

Thanks everyone!

--Matt Topol

RE: [C++][Go] CGO For Dataset API Integration

Posted by Matthew Topol <mt...@factset.com>.
That's precisely what I was suggesting and expecting. Sounds good. I'll clean up my POC and make a proper PR sometime soon.

Thanks much for the discussion!

--Matt

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Monday, August 23, 2021 2:18 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Then we could provide a small C dataset API somewhere in the C++ source tree (perhaps `arrow/dataset/c/api.h`?).  It would be unstable/experimental and could undergo changes or even removal without notice.

Regards

Antoine.


Le 23/08/2021 à 20:07, Matthew Topol a écrit :
> Because go is always statically compiled for whatever platform you're on at the time, the default behavior is for importing go libraries using `go get` from the command line actually does a git clone of the code and compiles it on the fly (because go's compiler is pretty darn fast) and caches statically compiled versions of dependencies for builds. Go-Arrow follows this same typical golang pattern that you don't distribute a shared object binary but rather the go tool pulls the code and compiles the relevant code as needed. Because the import path tells it the full path to the go.mod file (the top of the module) it knows that it only needs the go/arrow directory tree for the module and as such it doesn't clone the entire git repo.
> 
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Monday, August 23, 2021 2:00 PM
> To: dev@arrow.apache.org
> Subject: Re: [C++][Go] CGO For Dataset API Integration
> 
> 
> Le 23/08/2021 à 19:53, Matthew Topol a écrit :
>> The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders.
> 
> Hmm, I think I'm lacking some context.  Do consumers of Go-Arrow code typically recompile Go-Arrow instead of using an existing binary?
> 

Re: [C++][Go] CGO For Dataset API Integration

Posted by Antoine Pitrou <an...@python.org>.
Then we could provide a small C dataset API somewhere in the C++ source 
tree (perhaps `arrow/dataset/c/api.h`?).  It would be 
unstable/experimental and could undergo changes or even removal without 
notice.

Regards

Antoine.


Le 23/08/2021 à 20:07, Matthew Topol a écrit :
> Because go is always statically compiled for whatever platform you're on at the time, the default behavior is for importing go libraries using `go get` from the command line actually does a git clone of the code and compiles it on the fly (because go's compiler is pretty darn fast) and caches statically compiled versions of dependencies for builds. Go-Arrow follows this same typical golang pattern that you don't distribute a shared object binary but rather the go tool pulls the code and compiles the relevant code as needed. Because the import path tells it the full path to the go.mod file (the top of the module) it knows that it only needs the go/arrow directory tree for the module and as such it doesn't clone the entire git repo.
> 
> -----Original Message-----
> From: Antoine Pitrou <an...@python.org>
> Sent: Monday, August 23, 2021 2:00 PM
> To: dev@arrow.apache.org
> Subject: Re: [C++][Go] CGO For Dataset API Integration
> 
> 
> Le 23/08/2021 à 19:53, Matthew Topol a écrit :
>> The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders.
> 
> Hmm, I think I'm lacking some context.  Do consumers of Go-Arrow code typically recompile Go-Arrow instead of using an existing binary?
> 

RE: [C++][Go] CGO For Dataset API Integration

Posted by Matthew Topol <mt...@factset.com>.
Because go is always statically compiled for whatever platform you're on at the time, the default behavior is for importing go libraries using `go get` from the command line actually does a git clone of the code and compiles it on the fly (because go's compiler is pretty darn fast) and caches statically compiled versions of dependencies for builds. Go-Arrow follows this same typical golang pattern that you don't distribute a shared object binary but rather the go tool pulls the code and compiles the relevant code as needed. Because the import path tells it the full path to the go.mod file (the top of the module) it knows that it only needs the go/arrow directory tree for the module and as such it doesn't clone the entire git repo.

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Monday, August 23, 2021 2:00 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Le 23/08/2021 à 19:53, Matthew Topol a écrit :
> The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders.

Hmm, I think I'm lacking some context.  Do consumers of Go-Arrow code typically recompile Go-Arrow instead of using an existing binary?


Re: [C++][Go] CGO For Dataset API Integration

Posted by Antoine Pitrou <an...@python.org>.
Le 23/08/2021 à 19:53, Matthew Topol a écrit :
> The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders.

Hmm, I think I'm lacking some context.  Do consumers of Go-Arrow code 
typically recompile Go-Arrow instead of using an existing binary?


RE: [C++][Go] CGO For Dataset API Integration

Posted by Matthew Topol <mt...@factset.com>.
The only thing I don't like it being a private module in the Go implementation is distribution. For native go code, consumers can just perform `go get` and have it work. But for this interface, it would require both consumers of the module and any consumers of those consumers to have a local built version of this library locally when building their Go code. Easy to static link in for distributing binaries, but not for library builders. 

Currently, the Arrow C++ source tree, already has everything set up and configured for being able to distribute the build artifacts for the various platforms, which I assume is also why the C++ code for the JNI dataset library is in the C++ source tree (correct me if I'm wrong please). The Golang build and deploy scripts don't have such a deployment because there typically is no need for such a deployment with Go. So even if it's a separate private module, I'd still prefer for it to at least be in the cpp source tree (perhaps a cpp/src/cgo directory?) in order to benefit from the existing build and CI tooling for deployment and distribution. This way as long as the necessary dependency (i.e. "apt install libarrow_dataset_cgo") exists, then `go get github.com/apache/arrow/go/dataset` would work without issue, rather than requiring additional steps for developers.

Unless there's an easy way to grab the c++ code from the Go source tree in this case and add it to the libraries being deployed from the C++ build? I'm not familiar enough with that deployment configuration to know if it's actually easy to hook into for compiling and deploying a library that isn't in the C++ source tree.

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Monday, August 23, 2021 1:24 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Le 23/08/2021 à 19:16, Matthew Topol a écrit :
> Unfortunately, Go currently can only integrate with C++ libraries through a C interface. There does exist SWIG which is a generator for creating interface code between Go and C++, but ultimately it's just automating the creation of a C interface and Go glue code. Personally I'm not a fan of the code that SWIG generates and haven't had too much luck with it.
> 
> I have a working POC of using the datasets API via CGO through a C interface (basically just passing around a uintptr_t which is the address of a heap allocated shared_ptr to a DatasetFactory/Dataset/Scanner and using the C Data interface for passing the resulting record batches through without copying), but couldn't decide on the best way to go about integrating the idea and cleaning it up into a real PR, hence this email thread. I initially was porting the Dataset API to Go, but ran into the fact that it uses the compute expression classes to define things and perform the filtering and realized that it wouldn't be a good idea to try porting the entire compute library.
> 
> So it just becomes a question as to what level I do the implementation and at what level do I make the calls to a C interface to call into the C++, and then whether or not the interface is a separate component from the existing dataset/compute libraries which can get linked into the Go, optionally as a separate module so that it's not creating a dependency on the C++ libraries for the current arrow Go implementation, only for using the Dataset API stuff (and potentially the compute library).

I think the dataset C interface can start as a private module in the Go implementation.  If it may be useful to other people then we can consider transferring it into the Arrow C++ source tree.

Regards

Antoine.

Re: [C++][Go] CGO For Dataset API Integration

Posted by Antoine Pitrou <an...@python.org>.
Le 23/08/2021 à 19:16, Matthew Topol a écrit :
> Unfortunately, Go currently can only integrate with C++ libraries through a C interface. There does exist SWIG which is a generator for creating interface code between Go and C++, but ultimately it's just automating the creation of a C interface and Go glue code. Personally I'm not a fan of the code that SWIG generates and haven't had too much luck with it.
> 
> I have a working POC of using the datasets API via CGO through a C interface (basically just passing around a uintptr_t which is the address of a heap allocated shared_ptr to a DatasetFactory/Dataset/Scanner and using the C Data interface for passing the resulting record batches through without copying), but couldn't decide on the best way to go about integrating the idea and cleaning it up into a real PR, hence this email thread. I initially was porting the Dataset API to Go, but ran into the fact that it uses the compute expression classes to define things and perform the filtering and realized that it wouldn't be a good idea to try porting the entire compute library.
> 
> So it just becomes a question as to what level I do the implementation and at what level do I make the calls to a C interface to call into the C++, and then whether or not the interface is a separate component from the existing dataset/compute libraries which can get linked into the Go, optionally as a separate module so that it's not creating a dependency on the C++ libraries for the current arrow Go implementation, only for using the Dataset API stuff (and potentially the compute library).

I think the dataset C interface can start as a private module in the Go 
implementation.  If it may be useful to other people then we can 
consider transferring it into the Arrow C++ source tree.

Regards

Antoine.

RE: [C++][Go] CGO For Dataset API Integration

Posted by Matthew Topol <mt...@factset.com>.
Unfortunately, Go currently can only integrate with C++ libraries through a C interface. There does exist SWIG which is a generator for creating interface code between Go and C++, but ultimately it's just automating the creation of a C interface and Go glue code. Personally I'm not a fan of the code that SWIG generates and haven't had too much luck with it. 

I have a working POC of using the datasets API via CGO through a C interface (basically just passing around a uintptr_t which is the address of a heap allocated shared_ptr to a DatasetFactory/Dataset/Scanner and using the C Data interface for passing the resulting record batches through without copying), but couldn't decide on the best way to go about integrating the idea and cleaning it up into a real PR, hence this email thread. I initially was porting the Dataset API to Go, but ran into the fact that it uses the compute expression classes to define things and perform the filtering and realized that it wouldn't be a good idea to try porting the entire compute library.

So it just becomes a question as to what level I do the implementation and at what level do I make the calls to a C interface to call into the C++, and then whether or not the interface is a separate component from the existing dataset/compute libraries which can get linked into the Go, optionally as a separate module so that it's not creating a dependency on the C++ libraries for the current arrow Go implementation, only for using the Dataset API stuff (and potentially the compute library).

--Matt

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Monday, August 23, 2021 12:56 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Le 23/08/2021 à 18:22, Matthew Topol a écrit :
> That's a fair point, and part of the work I've done so far is a local Go implementation of at least consuming the C data interface. It will also eventually involve creating the necessary implementation to produce the C-Data interface too. But specifically I'm asking for opinions on using that C-Data interface to build a C *programming* interface to the C++ Dataset API in the same vein as the JNI interface, so that Go could use the dataset api without having to reimplement the entirety of it.
> 
> Given the difference between a *programming* interface and a *data* interface, I suppose the recommendation would be that creating a C Programming Interface for the Dataset API (using the C-Data interface for producing/consuming the actual Arrow data) should be a separate component like libarrow_dataset_jni rather than integrating it directly into the dataset component. Right?
> 
> If it's not necessary for there to be Go specific things in the interface, then it could just be called *libarrow_dataset_c* or something equivalent, but would still be a separate component which just relies on the dataset api rather than being integrated into it. Does that make sense?

That does make sense, though I wonder how usable a C API to datasets would be.  Being able to integrate with the C++ API from Go would probably make more sense.

Regards

Antoine.

Re: [C++][Go] CGO For Dataset API Integration

Posted by Antoine Pitrou <an...@python.org>.
Le 23/08/2021 à 18:22, Matthew Topol a écrit :
> That's a fair point, and part of the work I've done so far is a local Go implementation of at least consuming the C data interface. It will also eventually involve creating the necessary implementation to produce the C-Data interface too. But specifically I'm asking for opinions on using that C-Data interface to build a C *programming* interface to the C++ Dataset API in the same vein as the JNI interface, so that Go could use the dataset api without having to reimplement the entirety of it.
> 
> Given the difference between a *programming* interface and a *data* interface, I suppose the recommendation would be that creating a C Programming Interface for the Dataset API (using the C-Data interface for producing/consuming the actual Arrow data) should be a separate component like libarrow_dataset_jni rather than integrating it directly into the dataset component. Right?
> 
> If it's not necessary for there to be Go specific things in the interface, then it could just be called *libarrow_dataset_c* or something equivalent, but would still be a separate component which just relies on the dataset api rather than being integrated into it. Does that make sense?

That does make sense, though I wonder how usable a C API to datasets 
would be.  Being able to integrate with the C++ API from Go would 
probably make more sense.

Regards

Antoine.

RE: [C++][Go] CGO For Dataset API Integration

Posted by Matthew Topol <mt...@factset.com>.
That's a fair point, and part of the work I've done so far is a local Go implementation of at least consuming the C data interface. It will also eventually involve creating the necessary implementation to produce the C-Data interface too. But specifically I'm asking for opinions on using that C-Data interface to build a C *programming* interface to the C++ Dataset API in the same vein as the JNI interface, so that Go could use the dataset api without having to reimplement the entirety of it. 

Given the difference between a *programming* interface and a *data* interface, I suppose the recommendation would be that creating a C Programming Interface for the Dataset API (using the C-Data interface for producing/consuming the actual Arrow data) should be a separate component like libarrow_dataset_jni rather than integrating it directly into the dataset component. Right?

If it's not necessary for there to be Go specific things in the interface, then it could just be called *libarrow_dataset_c* or something equivalent, but would still be a separate component which just relies on the dataset api rather than being integrated into it. Does that make sense?

Alternately, I could create a Go implementation of the dataset API, but then use CGO to make the necessary calls to the compute/gandiva apis at that level, instead of at the dataset API level. I'm trying to find the right balance between maintainability and complexity as it's certainly not a long-term viable idea to reimplement the entire compute library using Go as then it would need to be maintained separately from the C++ implementation, rather than just being able to hook into the C++ implementation directly (which I presume is the motivation for using JNI to do the same, aside from performance). 

--Matt

-----Original Message-----
From: Antoine Pitrou <an...@python.org> 
Sent: Monday, August 23, 2021 12:00 PM
To: dev@arrow.apache.org
Subject: Re: [C++][Go] CGO For Dataset API Integration


Hi Matt,

As the name suggests, the C data interface is not a *programming* interface.  It is a data sharing convention which relies on the existence of dedicated endpoints to produce or consume the C data structures.

For example in Arrow C++, there is this set of APIs:
https://urldefense.com/v3/__https://arrow.apache.org/docs/cpp/api/c_abi.html*c-data-interface__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1NCWGfMM$ 

In PyArrow:
https://urldefense.com/v3/__https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi*L1267-L1305__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1zUaRHBo$ 

In Rust:
https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/trait.Array.html*method.to_raw__;Iw!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1m32Eito$
https://urldefense.com/v3/__https://docs.rs/arrow/5.2.0/arrow/array/fn.make_array_from_raw.html__;!!PBKjc0U4!fVwgcRfC6MNCyEGP85Y7Zw2mMMtoRWhTOwbXLzU2HxxKFtP-yDuTQLu1SV0sjKs$ 

The first thing to do would be for the Go implementation to implement the C data interface.

Regards

Antoine.



Le 23/08/2021 à 16:07, Matthew Topol a écrit :
> Hey All,
> 
> So I've been working on a use case where I needed to be able to use the Dataset API from Golang and instead of trying to port all of it to Golang (which would require porting the Compute side too) I decided to create a proof of concept using CGO to just call into the existing C++ code in a similar fashion to how the Java solution is using JNI for the same thing. After proving to myself it works I came up with a question that I figured would be best sent to this mailing list.
> 
> When building it out, CGO just needs a C-API exposed for it to work and while there is a C Data interface designed for using Arrow, there is not currently a C Data Interface designed for the Dataset API. As a result, the big question is that if I wanted to contribute the work to the Arrow Repo, should a C Interface for the Dataset API be put as a separate directory and separate build artifact like the JNI interface, or should it just be directly added to and exported from the Dataset library? It's an organizational question because either way it would need to exist on anywhere that the Go code that wants to hit it would be being built, so it's the difference between just needing libarrow_dataset.so (and it's dependencies) or needing that *and* libarrow_dataset_cgo.so/.a, etc.
> 
> I'm curious what everyone's opinions might be on this so I can get an idea of which direction I should go before trying to put a PR together.
> 
> Thanks everyone!
> 
> --Matt Topol
> 

Re: [C++][Go] CGO For Dataset API Integration

Posted by Antoine Pitrou <an...@python.org>.
Hi Matt,

As the name suggests, the C data interface is not a *programming* 
interface.  It is a data sharing convention which relies on the 
existence of dedicated endpoints to produce or consume the C data 
structures.

For example in Arrow C++, there is this set of APIs:
https://arrow.apache.org/docs/cpp/api/c_abi.html#c-data-interface

In PyArrow:
https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L1267-L1305

In Rust:
https://docs.rs/arrow/5.2.0/arrow/array/trait.Array.html#method.to_raw
https://docs.rs/arrow/5.2.0/arrow/array/fn.make_array_from_raw.html

The first thing to do would be for the Go implementation to implement 
the C data interface.

Regards

Antoine.



Le 23/08/2021 à 16:07, Matthew Topol a écrit :
> Hey All,
> 
> So I've been working on a use case where I needed to be able to use the Dataset API from Golang and instead of trying to port all of it to Golang (which would require porting the Compute side too) I decided to create a proof of concept using CGO to just call into the existing C++ code in a similar fashion to how the Java solution is using JNI for the same thing. After proving to myself it works I came up with a question that I figured would be best sent to this mailing list.
> 
> When building it out, CGO just needs a C-API exposed for it to work and while there is a C Data interface designed for using Arrow, there is not currently a C Data Interface designed for the Dataset API. As a result, the big question is that if I wanted to contribute the work to the Arrow Repo, should a C Interface for the Dataset API be put as a separate directory and separate build artifact like the JNI interface, or should it just be directly added to and exported from the Dataset library? It's an organizational question because either way it would need to exist on anywhere that the Go code that wants to hit it would be being built, so it's the difference between just needing libarrow_dataset.so (and it's dependencies) or needing that *and* libarrow_dataset_cgo.so/.a, etc.
> 
> I'm curious what everyone's opinions might be on this so I can get an idea of which direction I should go before trying to put a PR together.
> 
> Thanks everyone!
> 
> --Matt Topol
>