You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "test24242 (via GitHub)" <gi...@apache.org> on 2023/06/23 17:06:27 UTC

[GitHub] [arrow] test24242 opened a new issue, #36274: Expose a std::shared_ptr to R SEXP

test24242 opened a new issue, #36274:
URL: https://github.com/apache/arrow/issues/36274

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello,
   
   I'm trying to build bindings around a lib that expose std::shared_ptr<arrow::Table> to R scripts.
   
   Is there a way to access to function doing the conversion from arrow::Table to R SEXP from arrow-r?
   
   Regards
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaoZeke commented on issue #36274: Expose a std::shared_ptr to R SEXP

Posted by "HaoZeke (via GitHub)" <gi...@apache.org>.
HaoZeke commented on issue #36274:
URL: https://github.com/apache/arrow/issues/36274#issuecomment-1646714752

   Could you expand on the answer @paleolimbot? In my use case, for example, I have some C++ code which creates a `std::sharde_ptr<arrow::Table>` object which can no longer be returned directly to `R` via `Rcpp::export` since the change to `cpp11`. What is the best way to have a zero copy interface in this situation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #36274: Expose a std::shared_ptr to R SEXP

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #36274:
URL: https://github.com/apache/arrow/issues/36274#issuecomment-1666930992

   Sorry for the delay here...I was taking some time away from the keyboard.
   
   It seems like you are interested in the reverse problem...the reprex above demos taking an Arrow Table from R and doing a computation in C++ that doesn't return a Table. Below I've tweaked it a bit to illustrate the reverse process (i.e., if you have a Table in Arrow C++, how to communicate it back to the Arrow R package to get a Table object):
   
   ``` r
   # These are specific to my system (homebrew on MacOS M1)
   arrow_include <- "-I/opt/homebrew/Cellar/apache-arrow/12.0.1/include"
   arrow_libs <- "-L/opt/homebrew/Cellar/apache-arrow/12.0.1/lib -larrow"
   Sys.setenv("PKG_CXXFLAGS" = arrow_include)
   Sys.setenv("PKG_LIBS" = arrow_libs)
   
   cpp11::cpp_source(code = '
   #include <arrow/table.h>
   #include <arrow/c/bridge.h>
   #include <cpp11.hpp>
   
   using namespace arrow;
   
   // Version that returns a Result<> so we can use Arrow C++-style error handling
   // macros
   Result<std::shared_ptr<Table>> array_stream_to_table(SEXP array_stream_xptr) {
     auto array_stream = reinterpret_cast<struct ArrowArrayStream*>(
       R_ExternalPtrAddr(array_stream_xptr));
     
     ARROW_ASSIGN_OR_RAISE(auto reader, ImportRecordBatchReader(array_stream))
     
     return reader->ToTable();
   }
   
   Status table_to_array_stream(const std::shared_ptr<Table>& table, SEXP array_stream_xptr) {
     auto reader = std::make_shared<arrow::TableBatchReader>(table);
     auto array_stream = reinterpret_cast<struct ArrowArrayStream*>(
       R_ExternalPtrAddr(array_stream_xptr));
     return ExportRecordBatchReader(reader, array_stream);
   }
   
   // Version that uses cpp11 error handling
   [[cpp11::register]]
   void slice_table(SEXP array_stream_xptr_in, int offset, int length, SEXP array_stream_xptr_out) {
   
     Result<std::shared_ptr<Table>> maybe_input = array_stream_to_table(array_stream_xptr_in);
     if (!maybe_input.ok()) {
       cpp11::stop("Arrow C++ error: %s", maybe_input.status().ToString().c_str());
     }
     
     std::shared_ptr<Table> input = *maybe_input;
     std::shared_ptr<Table> output = input->Slice(offset, length);
     
     Status status = table_to_array_stream(output, array_stream_xptr_out);
     if (!status.ok()) {
       cpp11::stop("Arrow C++ error: %s", status.ToString().c_str());
     }
   }
   
   ', cxx_std = "CXX17")
   
   
   library(arrow, warn.conflicts = FALSE)
   library(nanoarrow)
   
   # Prepare input
   tab <- arrow_table(x = 1:10)
   array_stream_in = as_nanoarrow_array_stream(tab)
   array_stream_out = nanoarrow_allocate_array_stream()
   
   # Call C++ function
   slice_table(array_stream_in, 2, 7, array_stream_out)
   
   # convert output to Table
   as_arrow_table(as_record_batch_reader(array_stream_out))
   #> Table
   #> 7 rows x 1 columns
   #> $x <int32>
   ```
   
   <sup>Created on 2023-08-06 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #36274: Expose a std::shared_ptr to R SEXP

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #36274:
URL: https://github.com/apache/arrow/issues/36274#issuecomment-1607431346

   The `std::shared_ptr<arrow::Table>` is a C++ pointer tied to a very specific version of Arrow C++ built with very specific compiler flags. Pointers like this are usually not exposed to other scripts or packages in R because it is difficult to guarantee stability. When you say "expose to R scripts"...do you mean that you have some Arrow C++ code linked to R using something like Rcpp?
   
   I think what you may be looking for is the C data interface. Arrow C++ can export a table as an ABI-stable stream of record batches. This is not *quite* the same as a table but will allow you to export the Table from the arrow R package and import it using C++ from elsewhere.
   
   ``` r
   # These are specific to my system (homebrew on MacOS M1)
   arrow_include <- "-I/opt/homebrew/Cellar/apache-arrow/12.0.0_1/include"
   arrow_libs <- "-L/opt/homebrew/Cellar/apache-arrow/12.0.0_1/lib -larrow"
   Sys.setenv("PKG_CXXFLAGS" = arrow_include)
   Sys.setenv("PKG_LIBS" = arrow_libs)
   
   cpp11::cpp_source(code = '
   #include <arrow/table.h>
   #include <arrow/c/bridge.h>
   #include <cpp11.hpp>
   
   using namespace arrow;
   
   // Version that returns a Result<> so we can use Arrow C++-style error handling
   // macros
   Result<int> count_rows_internal(SEXP array_stream_xptr) {
     auto array_stream = reinterpret_cast<struct ArrowArrayStream*>(
       R_ExternalPtrAddr(array_stream_xptr));
     
     ARROW_ASSIGN_OR_RAISE(auto reader, ImportRecordBatchReader(array_stream))
     
     std::shared_ptr<Table> table;
     ARROW_RETURN_NOT_OK(reader->ReadAll(&table));
     
     return table->num_rows();
   }
   
   // Version that uses cpp11 error handling
   [[cpp11::register]]
   int count_rows(SEXP array_stream_xptr) {
     Result<int> num_rows = count_rows_internal(array_stream_xptr);
     if (num_rows.ok()) {
       return *num_rows;
     } else {
       cpp11::stop("Arrow C++ error: %s", num_rows.status().ToString().c_str());
     }
   }
   
   ', cxx_std = "CXX17")
   
   
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
   library(nanoarrow)
   
   tab <- arrow_table(x = 1:10)
   (array_stream <- as_nanoarrow_array_stream(tab))
   #> <nanoarrow_array_stream struct<x: int32>>
   #>  $ get_schema:function ()  
   #>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
   #>  $ release   :function ()
   count_rows(array_stream)
   #> [1] 10
   ```
   
   <sup>Created on 2023-06-26 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #36274: Expose a std::shared_ptr to R SEXP

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #36274:
URL: https://github.com/apache/arrow/issues/36274#issuecomment-1719510628

   > Does the time complexity of that scale with the number of chunks or the number of entries or neither?
   
   I am actually not sure of the time complexity of `as_record_batch_reader()` and what it depends on. My guess would be that it is very, very low but might be observable if your table has many (thousands) of columns. It almost certainly does not depend on the number of rows but might depend on the number of chunks. You will have to benchmark and see for the type of data you're planning to pass.
   
   > My understanding is that it is possible to do a `std::shared_ptr<arrow::Table>` to `pyarrow::Table` cast, which means that there isn't any table copying going on.
   
   If you control the builds of both the Arrow R package and whatever C++ you're writing (e.g., via setting `ARROW_HOME` and building your own arrow R package or distributing an R package via conda-forge), you can do this in R too. It is my understanding that the only way to do this safely in Python would be via an `arrow-cpp` conda dependency (i.e., distribution via `pip` would be unsafe). Similarly, if you did this in R, distribution via the usual packaging process would not be safe because you do not control the build of the arrow R package.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] MysteriousPraetorian commented on issue #36274: Expose a std::shared_ptr to R SEXP

Posted by "MysteriousPraetorian (via GitHub)" <gi...@apache.org>.
MysteriousPraetorian commented on issue #36274:
URL: https://github.com/apache/arrow/issues/36274#issuecomment-1717669010

   @paleolimbot You mentioned that the `std::shared_ptr<arrow::Table>` is built with very specific compiler flags, so for stability reasons, it is unlikely to be exposed. From a pure speed standpoint, does that mean that R will always be at a disadvantage relative to Python? My understanding is that it is possible to do a `std::shared_ptr<arrow::Table>` to `pyarrow::Table` cast, which means that there isn't any table copying going on.
   
   Perhaps I am misunderstanding the `as_record_batch_reader` call. Does the time complexity of that scale with the number of chunks or the number of entries or neither?
   
   Cheers,
   MysteriousPraetorian


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org