You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/25 01:53:57 UTC

[GitHub] [arrow] zeroshade opened a new pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

zeroshade opened a new pull request #10995:
URL: https://github.com/apache/arrow/pull/10995


   This adds a new workflow to build the Dataset Go Module with CGO which first builds the C++ library before building the Go Dataset library and running the tests. This only exposes a basic ability to create a FilesystemDataset and scan it to get record batches. It can then be built upon more in order to add more functionality. I drew inspiration from the existing implementation of the C Data interface and the JNI Dataset library in order to build this but I'm looking for any feedback at all.
   
   @pitrou This is what I was talking about on the dev mailing list the other day.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#issuecomment-905106440


   https://issues.apache.org/jira/browse/ARROW-13742


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeroshade closed pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
zeroshade closed pull request #10995:
URL: https://github.com/apache/arrow/pull/10995


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeroshade commented on a change in pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
zeroshade commented on a change in pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#discussion_r700471033



##########
File path: cpp/src/arrow/dataset/c/api.h
##########
@@ -0,0 +1,137 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <stdint.h>
+
+#include "arrow/c/abi.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// For passing in as file format IDs and used to choose the correct file
+// format class
+#define DS_PARQUET_FORMAT 0
+#define DS_CSV_FORMAT 1
+#define DS_IPC_FORMAT 2
+
+// propagate the special value for inspecting all fragments instead of the default
+// that only inspects 1 fragment to get the schema
+extern const int kInspectAllFragments;
+#define DEFAULT_NUM_FRAGMENTS 1
+#define DISABLE_INSPECT_FRAGMENTS 0
+
+// Analagous to arrow::dataset::Scanner
+struct Scanner {

Review comment:
       That's a fair assessment. In particular since doing this, i've decided to try a different approach and see how it goes, see #11037 where I've implemented consumer functions for the C Data interface for go. I'm looking into trying to just do a minimal binding to the Compute API directly instead and the building on top of that. If that ends up being viable, I'll close out this PR in favor of that approach.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#discussion_r700369739



##########
File path: cpp/src/arrow/dataset/c/api.h
##########
@@ -0,0 +1,137 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <stdint.h>
+
+#include "arrow/c/abi.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// For passing in as file format IDs and used to choose the correct file
+// format class
+#define DS_PARQUET_FORMAT 0
+#define DS_CSV_FORMAT 1
+#define DS_IPC_FORMAT 2
+
+// propagate the special value for inspecting all fragments instead of the default
+// that only inspects 1 fragment to get the schema
+extern const int kInspectAllFragments;
+#define DEFAULT_NUM_FRAGMENTS 1
+#define DISABLE_INSPECT_FRAGMENTS 0
+
+// Analagous to arrow::dataset::Scanner
+struct Scanner {

Review comment:
       I agree with @pitrou - the binding should be as minimal as possible since the classes of the dataset and compute APIs are experimental and changed frequently. In particular, we've discussed removing Scanner altogether.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#issuecomment-908325984


   Some high-level comments:
   1) since C has no namespacing, all names should be prefix by "Arrow" (for regular names) or "ARROW_" (for preprocessor macros)
   2) the proposed API is taking the C data interface as an inspiration. However, unless the intent is to allow different producers to provide the dataset API, this could be a more classical (idiomatic) C API.
   3) the dataset can be represented using the (experimental) C stream interface: https://arrow.apache.org/docs/format/CStreamInterface.html
   
   In the end, the C API might look like this:
   ```c
   
   #include "arrow/c/abi.h"
   
   struct ArrowDatasetFactory;
   
   enum ArrowDatasetFormat {
       ARROW_DATASET_PARQUET, ARROW_DATASET_CSV, ARROW_DATASET_IPC
   };
   
   int ArrowDatasetFactoryFromUri(const char* uri,
                                  struct ArrowDatasetFormat format,
                                  struct ArrowDatasetFactory** out);
   
   int ArrowDatasetFactoryInspect(struct ArrowDatasetFactory* factory,
                                  int num_fragments_to_inspect,
                                  struct ArrowSchema* out);
   
   int ArrowDatasetFactoryCreateDataset(struct ArrowDatasetFactory* factory,
                                        struct ArrowSchema* optional_schema,
                                        struct ArrowArrayStream* out);
   
   const char* ArrowDatasetFactoryGetLastError();
   
   void ArrowDatasetFactoryDestroy(struct ArrowDatasetFactory*);
   
   ```
   
   cc @bkietz for advice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#issuecomment-908325984


   Some high-level comments:
   1) since C has no namespacing, all names should be prefix by "Arrow" (for regular names) or "ARROW_" (for preprocessor macros)
   2) the proposed API is taking the C data interface as an inspiration. However, unless the intent is to allow different producers to provide the dataset API, this could be a more classical (idiomatic) C API.
   3) the dataset can be represented using the (experimental) C stream interface: https://arrow.apache.org/docs/format/CStreamInterface.html
   
   In the end, the C API might look like this:
   ```c
   
   #include "arrow/c/abi.h"
   
   struct ArrowDatasetFactory;
   
   enum ArrowDatasetFormat {
       ARROW_DATASET_PARQUET, ARROW_DATASET_CSV, ARROW_DATASET_IPC
   };
   
   int ArrowDatasetFactoryFromUri(const char* uri,
                                  struct ArrowDatasetFormat format,
                                  struct ArrowDatasetFactory** out);
   
   int ArrowDatasetFactoryInspect(struct ArrowDatasetFactory* factory,
                                  int num_fragments_to_inspect,
                                  struct ArrowSchema* out);
   
   int ArrowDatasetFactoryCreateDataset(struct ArrowDatasetFactory* factory,
                                        struct ArrowSchema* optional_schema,
                                        struct ArrowArrayStream* out);
   
   const char* ArrowDatasetFactoryGetLastError();
   
   void ArrowDatasetFactoryDestroy(struct ArrowDatasetFactory*);
   
   ```
   
   cc @bkietz for advice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeroshade commented on pull request #10995: ARROW-13742: [C++][Go] Expose Dataset API to Go via CGO

Posted by GitBox <gi...@apache.org>.
zeroshade commented on pull request #10995:
URL: https://github.com/apache/arrow/pull/10995#issuecomment-916354879


   Closing this in favor of #11037 and continuing that approach which ensures I'll be able to keep the bindings as minimal as possible and only bind directly to the compute api as minimally as I can rather than the entire dataset api.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org