You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/08 05:23:12 UTC

[GitHub] [arrow] multimeric opened a new issue #12102: How to pass an in-memory arrow object from Rust into R

multimeric opened a new issue #12102:
URL: https://github.com/apache/arrow/issues/12102


   Hi. In the [extendr](https://github.com/extendr/extendr/) project (R bindings for Rust) we're looking into how to integrate [Polars](https://github.com/pola-rs/polars) (a prominent Rust Data Frame library that uses Arrow arrays internally) with R. Since Polars already stores arrays in the arrow format, I was thinking that it should be possible to just [return a pointer from Rust directly to R](https://pola-rs.github.io/polars/polars/series/struct.Series.html#method.to_arrow), and then maybe set the class attribute, and then R's `arrow` should provide implementations of all the standard R generic functions a user might want. Does this approach make sense? Are there any other tricks I should know, for example attributes that R's `arrow` expects to be set, or perhaps you use a wrapper struct around the arrow array that we would have to implement?
   
   We're also interested in the reverse integration, namely passing data created in R back into Polars, but I might ask about that later on.
   
   Some discussion about this integration can be found here: https://github.com/extendr/extendr/issues/331.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008448831


   Thanks all, the help is greatly appreciated.
   
   I'll try the FFI interface, it seems to be what I want. It looks like the workflow will involve `let ptr = export_array_to_c(some_series.to_arrow());` in Rust, and then `ImportArray` in R. I assume this will set all the appropriate metadata for me.
   
   Oddly neither `Array$import_from_c` nor [`ImportArray`](https://github.com/apache/arrow/blob/cfcce5a243981deb21b47c358059ea85488fee86/r/R/arrowExports.R#L227-L229) seems to be included in the R API docs, even though they're exported functions. I wonder if that could be added in a later release?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008330881


   Hi. Thanks for the ping @nealrichardson .
   
   Thanks for the initiative, @multimeric , super cool!
   
   Note that the C data interface is designed for _intra_ process communication - R would be running on the same process as Polars.
   
   Polars uses an unofficial Rust implementation of Arrow, so we have to use its API here. Say you have a Polars DataFrame in Rust. You can extract any of its series via the index operator `[]`. A series is just a vector of Arrow arrays, which you get via [`.chunks`](https://docs.rs/polars/latest/polars/series/trait.SeriesTrait.html#method.chunks). At this point we can disregard Polars and just focus on Arrow. To export each of the arrays, you need 3 steps:
   
   1. allocate two empty ffi interfaces (two Rust Boxes with the ffi-compatible structs)
   2. write the array and field to them
   3. call the corresponding function to import the two from R
   
   * [Step 1](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L75)
   * [Step 2](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L81)
   * [Step 3 (in Python)](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L89)
   
   I am not very familiar with R, but I think that Step 3 amounts to call  `Array$import_from_c` from R. Note that all of these steps are `O(1)` and thus incur no performance cost (a core idea of the Arrow format).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008323791


   FWIW I think `polars` uses `arrow2` which also has an `ffi` module, but the interface is different: https://docs.rs/arrow2/0.8.1/arrow2/ffi/index.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1011061466


   And the FFI doesn't support `RecordBatch`es? So there's no way to pass entire data frames from one process to another? I suppose in that case it is necessary to reconstruct data frames in the destination language, using the Array pointers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008095193


   I think you want to use Arrow's C data interface. It's how we pass data between Python and R, to/from DuckDB in both Python and R, and how the Rust DataFusion project works with Python as well. 
   
   Some references:
   
   * reticulate methods in the arrow R package: https://github.com/apache/arrow/blob/master/r/R/python.R
   * implementation in arrow-rs: https://github.com/apache/arrow-rs/blob/master/arrow/src/ffi.rs
   * DataFusion's python library: https://github.com/datafusion-contrib/datafusion-python
   * Format docs: https://arrow.apache.org/docs/format/CDataInterface.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ritchie46 commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008334083


   If you are also using polars/arrow logical types (`Categorical`, `Datetime`, `Date`, `Duration`, or `Time`) you must ensure that the internal chunks of `Series` are coerced to their logical arrow type. This is done by calling [Series::to_arrow()](https://github.com/pola-rs/polars/blob/8f9010cf51e94b188002f7573de59ca94601247a/polars/polars-core/src/series/into.rs#L14).
   
   From there its indeed arrow FFI just like @jorgecarleitao said.
   
   If you need some inspiration here are the relevant functions for FFI with python we use in polars: https://github.com/pola-rs/polars/tree/master/py-polars/src/arrow_interop
   
   Good luck!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1011387538


   RecordBatch is a concept in the [Arrow IPC format](https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message), but Jorge is right, it is not in the C interface. But a RecordBatch and a StructArray are nearly identical, and that's how the C++ implementation of the C interface deals with them, for example. See https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L648


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric closed issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric closed issue #12102:
URL: https://github.com/apache/arrow/issues/12102


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008205268


   So once I `array.to_raw()` in Rust, I will have a pointer to an arrow object in the C format. How then can I pass it to the R `arrow`? I assume it needs some metadata?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008323398


   I think what you are looking for is called the `ffi` module in arrow-rs: https://docs.rs/arrow/6.5.0/arrow/ffi/index.html
   
   Perhaps something like
   ```rust
   let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008330881


   Hi. Thanks for the ping @nealrichardson .
   
   Thanks for the initiative, @multimeric , super cool!
   
   Note that the C data interface is designed for _intra_ process communication - R would be running on the same process as Polars.
   
   Polars uses an unofficial Rust implementation of Arrow, so we have to use its API here. Say you have a Polars DataFrame in Rust. You can extract any of its series via the index operator `[]`. A series is just a vector of Arrow arrays, which you get via [`.chunks`](https://docs.rs/polars/latest/polars/series/trait.SeriesTrait.html#method.chunks). At this point we can disregard Polars and just focus on Arrow. To export each of the arrays, you need 3 steps:
   
   1. allocate two empty ffi interfaces (two Rust Boxes with the ffi-compatible structs)
   2. write the array and schema to them
   3. call the corresponding function to import the two from R
   
   * [Step 1](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L75)
   * [Step 2](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L81)
   * [Step 3 (in Python)](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L89)
   
   I am not very familiar with R, but I think that Step 3 amounts to call  `Array$import_from_c` from R. Note that all of these steps are `O(1)` and thus incur no performance cost (a core idea of the Arrow format).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008323398


   I think what you are looking for is called the `ffi` module in arrow-rs: https://docs.rs/arrow/6.5.0/arrow/ffi/index.html
   
   Perhaps something like
   ```rust
   let (array_ptr, schema_ptr) = array.to_raw()?;
   ```
   
   Then you can pass the array_ptr and schema_ptr to the R implementation (which I am not familiar with).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008204725


   But why do I need an intermediate API at all? If I have a block of memory in the right format, can I not pass it to any library used by the same process?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1011065600


   A recordbatch is not part of the arrow spec (and in particular the c data interface), it is something done ad-hoc by implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] multimeric commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
multimeric commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008204949


   Oh I see, the Rust version uses Rust native data types so it's not automatically compatible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008330881


   Hi. Thanks for the ping @nealrichardson .
   
   Thanks for the initiative, @multimeric , super cool!
   
   Note that the C data interface is designed for _intra_ process communication - R would be running on the same process as Polars.
   
   Polars uses Rust's an unofficial implementation, so we have to use its API here. Say you have a Polars DataFrame in Rust. You can extract any of its series via the index operator `[]`. A series is just a vector of Arrow arrays, which you get via [`.chunks`](https://docs.rs/polars/latest/polars/series/trait.SeriesTrait.html#method.chunks). At this point we can disregard Polars and just focus on Arrow. To export each of the arrays, you need 3 steps:
   
   1. allocate two empty ffi interfaces (two Rust Boxes with the ffi-compatible structs)
   2. write the array and schema to them
   3. call the corresponding function to import the two from R
   
   * [Step 1](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L75)
   * [Step 2](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L81)
   * [Step 3 (in Python)](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L89)
   
   I am not very familiar with R, but I think that Step 3 amounts to call  `Array$import_from_c` from R. Note that all of these steps are `O(1)` and thus incur no performance cost (a core idea of the Arrow format).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008322593


   The C Data Interface is the way you pass a block of memory in process. I don't know the specifics of the Rust implementation to guide you further, maybe others can help there. The best I can suggest is that you triangulate from the links to the code I shared. 
   
   cc @kszucs @jorgecarleitao @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on issue #12102: How to pass an in-memory arrow object from Rust into R

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on issue #12102:
URL: https://github.com/apache/arrow/issues/12102#issuecomment-1008330881


   Hi. Thanks for the ping @nealrichardson .
   
   Thanks for the initiative, @multimeric , super cool!
   
   Note that the C data interface is designed for _intra_ process communication - R would be running on the same process as Polars.
   
   Polars uses an unofficial Rust implementation of Arrow, so we have to use its API here. Say you have a Polars DataFrame in Rust. You can extract any of its series via the index operator `[]`. A series is just a vector of Arrow arrays, which you get via [`.chunks`](https://docs.rs/polars/latest/polars/series/trait.SeriesTrait.html#method.chunks). At this point we can disregard Polars and just focus on Arrow. To export each of the arrays, you need 3 steps:
   
   1. allocate two empty ffi interfaces (two Rust Boxes with the ffi-compatible structs)
   2. write the array to them
   3. call the corresponding function to import the two from R
   
   * [Step 1](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L75)
   * [Step 2](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L81)
   * [Step 3 (in Python)](https://github.com/jorgecarleitao/arrow2/blob/main/arrow-pyarrow-integration-testing/src/lib.rs#L89)
   
   I am not very familiar with R, but I think that Step 3 amounts to call  `Array$import_from_c` from R. Note that all of these steps are `O(1)` and thus incur no performance cost (a core idea of the Arrow format).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org