You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/26 05:43:25 UTC

[GitHub] [arrow-datafusion] kou opened a new pull request, #2622: Add minimum C API

kou opened a new pull request, #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1113.
   
   # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   See #1113.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   This exports minimum C API to write the following Rust code in C:
   
       use datafusion::prelude::*;
   
       #[tokio::main]
       async fn main() -> datafusion::error::Result<()> {
         // register the table
         let mut ctx = ExecutionContext::new();
   
         // create a plan to run a SQL query
         let df = ctx.sql("SELECT 1").await?;
   
         // execute and print results
         df.show().await?;
         Ok(())
       }
   
   See datafusion/c/examples/sql.c for C version. You can build and run
   datafusion/c/examples/sql.c by the following command lines:
   
       $ cargo build
       $ cc -o target/debug/sql datafusion/c/examples/sql.c -Idatafusion/c/include -Ltarget/debug -Wl,--rpath=target/debug -ldatafusion_c
       $ target/debug/sql
       +----------+
       | Int64(1) |
       +----------+
       | 1        |
       +----------+
   
   This implementation doesn't export Future like
   datafusion-python. Async functions are block_on()-ed in exported
   API. But I think that we can export Future in follow-up tasks.
   
   Follow-up tasks:
   
     * Add support for testing by "cargo test"
     * Add support for building and running examples by "cargo ..."
     * Add support for installing datafusion.h
     * Add documentation
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   Users can use DataFusion from C and/or FFI.
   
   # Does this PR break compatibility with Ballista?
   
   <!--
   The CI checks will attempt to build [arrow-ballista](https://github.com/apache/arrow-ballista) against this PR. If 
   this check fails then it indicates that this PR makes a breaking change to the DataFusion API.
   
   If possible, try to make the change in a way that is not a breaking API change. For example, if code has moved 
    around, try adding `pub use` from the original location to preserve the current API.
   
   If it is not possible to avoid a breaking change (such as when adding enum variants) then follow this process:
   
   - Make a corresponding PR against `arrow-ballista` with the changes required there
   - Update `dev/build-arrow-ballista.sh` to clone the appropriate `arrow-ballista` repo & branch
   - Merge this PR when CI passes
   - Merge the Ballista PR
   - Create a new PR here to reset `dev/build-arrow-ballista.sh` to point to `arrow-ballista` master again
   
   _If you would like to help improve this process, please see https://github.com/apache/arrow-datafusion/issues/2583_
   -->
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] kou closed pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
kou closed pull request #2622: Export minimum C API and examples for C, Ruby and Python
URL: https://github.com/apache/arrow-datafusion/pull/2622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jimexist commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
Jimexist commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1141311167

   I agree with either creating a separate repo or merge as is but thanks for the good work now that I guess Java binding can use this c interface and possibly with the new jextract tool as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1142326648

   Moving this to draft to avoid accidental merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #2622: Export minimum C API

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1139890869

   Hi @kou and thanks for the contribution! 
   
   This looks really interesting but I would like to understand more about the motivation and context for this. Making DataFusion accessible from C makes sense but I am wondering if we should create a separate repository for this in https://github.com/datafusion-contrib/. This is where we have the Java and Python bindings for DataFusion. I'm also curious why we would want to go from Python -> C -> Rust rather than just Python -> Rust directly as we do in https://github.com/datafusion-contrib/datafusion-python
   
   I am also concerned that adding C code as part of the default build may be problematic for some users. I assume there are some minimum requirements for having this work on all platforms?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] kou commented on pull request #2622: Export minimum C API

Posted by GitBox <gi...@apache.org>.
kou commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1139402136

   I've added examples that use the C API from Python and Ruby with FFI library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1142045287

   > Introducing sync APIs - Rust's async is a barrier to integrating with other languages. A subset of features, like querying data that is entirely in-memory, would be supported without requiring async or sync-over-async. This would mean avoiding async where possible, due to the virality of async APIs.
   
   Yes I think this is the most reasonable solution suggestion -- don't expose any async APIs and have DataFusion do its thread pool / IO management internally. If people want the additional performance or resource control they could use the Rust APIs directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1141276360

   Thanks @kou. I created a new repo https://github.com/datafusion-contrib/datafusion-c where you can PR this work. Thanks again for adding another language binding!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] kou commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
kou commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1145712607

   I close this in favor of https://github.com/datafusion-contrib/datafusion-c/pull/1 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] kou commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
kou commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1140777959

   Regarding repository, I'm OK with it. Could you create https://github.com/datafusion-contrib/datafusion-c or something? Or should I create my repository for this?
   
   > I'm also curious why we would want to go from Python -> C -> Rust rather than just Python -> Rust directly as we do in https://github.com/datafusion-contrib/datafusion-python
   
   Sorry for confusing you. I didn't want to suggest that we should use this for Python bindings. I just wanted to show that we can use FFI library for bindings as an use case of this C API. I should have used Julia or something rather than Python.
   
   > * It would be nice to put a minimal readme in `datafusion/c` (saying, for example, that the directory contains the C API, see `examples/README.md` for more details). Or maybe we could move `datafusion/c/examples/README.md` to `datafusion/c/README.md` to make it more discoverable
   > 
   > * I think it would be a good idea to track planned follow on items as individual tasks
   
   They make sense. I'll do them in new repository for this.
   
   > I am not familiar how to interface Rust `async` functions with C -- I would assume it looks like callbacks somehow, but I can imagine how it gets very tricky very quickly
   
   I will not use callbacks like the following:
   
   ```c
   int
   main(void)
   {
     DFSessionContext *context = df_session_context_new();
     DFError *error = NULL;
   
     DFDataFrameFuture *sql_future =
       df_session_context_sql_async(context, "SELECT 1;", &error);
     if (error) {
       printf("failed to start SQL: %s\n", df_error_get_message(error));
       df_error_free(error);
       df_session_context_free(context);
       return EXIT_FAILURE;
     }
   
     DFDataFrame *data_frame = df_data_frame_future_await(sql_future, &error);
     if (error) {
       printf("failed to run SQL: %s\n", df_error_get_message(error));
       df_error_free(error);
       df_session_context_free(context);
       return EXIT_FAILURE;
     }
   
     DFFuture *show_future = df_data_frame_show_async(data_frame, &error);
     if (error) {
       printf("failed to start showing data frame: %s\n",
              df_error_get_message(error));
       df_error_free(error);
       df_data_frame_free(data_frame);
       return EXIT_FAILURE;
     }
   
     df_future_await(show_future, &error);
     if (error) {
       printf("failed to show data frame: %s\n",
              df_error_get_message(error));
       df_error_free(error);
       df_data_frame_free(data_frame);
       return EXIT_FAILURE;
     }
   
     df_data_frame_free(data_frame);
     df_session_context_free(context);
     return EXIT_SUCCESS;
   }
   ```
   
   But I may change my mind.
   
   
   Thanks for suggestions for my Rust code! This is my first Rust program. So suggestions are very welcome. :-)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#discussion_r884115250


##########
datafusion/c/src/lib.rs:
##########
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::boxed::Box;
+use std::ffi::CStr;
+use std::ffi::CString;
+use std::future::Future;
+use std::sync::Arc;
+
+use datafusion::dataframe::DataFrame;
+use datafusion::execution::context::SessionContext;
+
+#[repr(C)]
+pub struct DFError {
+    code: u32,
+    message: *mut libc::c_char,
+}
+
+impl DFError {
+    pub fn new(code: u32, message: *mut libc::c_char) -> Self {
+        Self { code, message }
+    }
+}
+
+#[no_mangle]
+#[allow(clippy::not_unsafe_ptr_arg_deref)]
+pub extern "C" fn df_error_new(code: u32, message: *const libc::c_char) -> *mut DFError {
+    let error = DFError::new(code, unsafe { libc::strdup(message) });
+    Box::into_raw(Box::new(error))
+}
+
+/// # Safety
+///
+/// This function should not be called with `error` that is not
+/// created by `df_errro_new()`.
+///
+/// This function should not be called for the same `error` multiple
+/// times.
+#[no_mangle]
+pub unsafe extern "C" fn df_error_free(error: *mut DFError) {
+    libc::free((*error).message as *mut libc::c_void);
+    Box::from_raw(error);
+}
+
+/// # Safety
+///
+/// This function should not be called with `error` that is not
+/// created by `df_errro_new()`.
+///
+/// This function should not be called with `error` that is freed by
+/// `df_error_free()`.
+#[no_mangle]
+pub unsafe extern "C" fn df_error_get_message(
+    error: *mut DFError,
+) -> *const libc::c_char {
+    (*error).message
+}
+
+trait IntoDFError {
+    type Value;
+    fn into_df_error(
+        self,
+        error: *mut *mut DFError,
+        error_value: Option<Self::Value>,
+    ) -> Option<Self::Value>;
+}
+
+impl<V, E: std::fmt::Display> IntoDFError for Result<V, E> {
+    type Value = V;
+    fn into_df_error(
+        self,
+        error: *mut *mut DFError,
+        error_value: Option<Self::Value>,
+    ) -> Option<Self::Value> {
+        match self {
+            Ok(value) => Some(value),
+            Err(e) => {
+                if !error.is_null() {
+                    let c_string_message = match CString::new(format!("{}", e)) {
+                        Ok(c_string_message) => c_string_message,
+                        Err(_) => return error_value,
+                    };
+                    unsafe {
+                        *error = df_error_new(1, c_string_message.as_ptr());
+                    };
+                }
+                error_value
+            }
+        }
+    }
+}
+
+fn block_on<F: Future>(future: F) -> F::Output {
+    tokio::runtime::Runtime::new().unwrap().block_on(future)

Review Comment:
   ```suggestion
       tokio::runtime::Runtime::new().expect("Can not create tokio runtime").block_on(future)
   ```



##########
datafusion/c/src/lib.rs:
##########
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::boxed::Box;
+use std::ffi::CStr;
+use std::ffi::CString;
+use std::future::Future;
+use std::sync::Arc;
+
+use datafusion::dataframe::DataFrame;
+use datafusion::execution::context::SessionContext;
+
+#[repr(C)]
+pub struct DFError {
+    code: u32,
+    message: *mut libc::c_char,
+}
+
+impl DFError {
+    pub fn new(code: u32, message: *mut libc::c_char) -> Self {
+        Self { code, message }
+    }
+}
+
+#[no_mangle]
+#[allow(clippy::not_unsafe_ptr_arg_deref)]
+pub extern "C" fn df_error_new(code: u32, message: *const libc::c_char) -> *mut DFError {
+    let error = DFError::new(code, unsafe { libc::strdup(message) });
+    Box::into_raw(Box::new(error))
+}
+
+/// # Safety
+///
+/// This function should not be called with `error` that is not
+/// created by `df_errro_new()`.
+///
+/// This function should not be called for the same `error` multiple
+/// times.
+#[no_mangle]
+pub unsafe extern "C" fn df_error_free(error: *mut DFError) {
+    libc::free((*error).message as *mut libc::c_void);
+    Box::from_raw(error);
+}
+
+/// # Safety
+///
+/// This function should not be called with `error` that is not
+/// created by `df_errro_new()`.
+///
+/// This function should not be called with `error` that is freed by
+/// `df_error_free()`.
+#[no_mangle]
+pub unsafe extern "C" fn df_error_get_message(
+    error: *mut DFError,
+) -> *const libc::c_char {
+    (*error).message
+}
+
+trait IntoDFError {
+    type Value;
+    fn into_df_error(
+        self,
+        error: *mut *mut DFError,
+        error_value: Option<Self::Value>,
+    ) -> Option<Self::Value>;
+}
+
+impl<V, E: std::fmt::Display> IntoDFError for Result<V, E> {
+    type Value = V;
+    fn into_df_error(
+        self,
+        error: *mut *mut DFError,
+        error_value: Option<Self::Value>,
+    ) -> Option<Self::Value> {
+        match self {
+            Ok(value) => Some(value),
+            Err(e) => {
+                if !error.is_null() {
+                    let c_string_message = match CString::new(format!("{}", e)) {

Review Comment:
   ```suggestion
                       let c_string_message = match CString::new(e.to_string()) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] loic-sharma commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
loic-sharma commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1141474057

   Hello, I've been watching DataFusion from the sidelines and am interested in using it in a Zig project. As a disclaimer, I'm new to Rust and am not that familiar with its async implementation.
   
   DataFusion's "query engine as a library" is similar to SQLite's "database as a library". SQLite's success is in part due to how easy it is to integrate it into *any* project, regardless of the language it uses. It'd be wonderful if DataFusion was also as easy to use in all projects!
   
   Today, a key challenge is Rust's async. While using blocking tricks is a great short-term solution, it is inefficient (it requires multiple threads) and can cause deadlock issues. Callbacks are not a perfect solution either (what if my language runtime requires stack unwinding?).
   
   In my opinion this would be helped by:
   
   1. **An excellent C API** - Most languages have tooling to integrate with C, so, a C API makes it easier to use DataFusion in other languages. While in the short-term it's perfectly fine for the C API to be in a separate repository, in the long-term I'd recommend having the C API inside the arrow-datafusion repository. This would be force new DataFusion API designs to take the C API into consideration as well.
   2. **Minimize async APIs** - Rust's async is a barrier to integrating with other languages. Ideally DataFusion should be able to query in-memory data using the caller's thread - similar to SQLite's threading model - without requiring async. DataFusion's APIs should be designed to avoid async unless absolutely necessary. Care should be taken to minimize the virality of async APIs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] houqp commented on pull request #2622: Export minimum C API and examples for C, Ruby and Python

Posted by GitBox <gi...@apache.org>.
houqp commented on PR #2622:
URL: https://github.com/apache/arrow-datafusion/pull/2622#issuecomment-1141526122

   Very cool demo @kou ! Agree that it would be better to manage it through the datafusion-c repo in the contrib github org.
   
   With regards to async, i think we can use tokio runtime block on to hide all the async apis behind a set of sync apis similar to what we do in our python binding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org