You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/04/28 16:04:47 UTC

[GitHub] [arrow-nanoarrow] wjones127 opened a new issue, #187: Show examples of building Arrow C++ structures

wjones127 opened a new issue, #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187

   For users coming from Arrow C++ / PyArrow, it might not be obvious what nanoarrow structures to create so they can export. Depending on the context, an `ArrowSchema` can represent a data type, a field, or a schema. Similarly, an `ArrowArray` can represent an array or record batch (tabular). To start, we should show the correspondence between Nanoarrow structs and Arrow C++ types.
   
   
   | Nanoarrow | Arrow C++ |
   |----|----|
   | `ArrowArray` | `arrow::Array` |
   | `ArrowArray` where type is struct | `arrow::RecordBatch` |
   | `std::vector<ArrowArray>` | `arrow::ChunkedArray` |
   | `std::vector<ArrowArray>` where type is struct | `arrow::Table` |
   | `ArrowSchema` | `arrow::DataType` |
   | `ArrowSchema` | `arrow::Field` |
   | `ArrowSchema` | `arrow::Schema` |
   
   Then we may also want recipes for:
   
    * How to build a struct array in nanoarrow, and export as an `arrow::RecordBatch`
    * How to build a record batch reader, and export as an `arrow::Table`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1822828319

   Cool! Is the example somewhere I can find it!? I would love to add it to `examples/linesplitter` or in exploded from to `r/vignettes/articles/extending.Rmd` (or blog post, or whatever!)
   
   > Tried setting the schema external pointer as a tag, but no luck so far.
   
   Hmm...I also usually do this in R (`nanoarrow::nanoarrow_array_set_schema()`). It seems pretty straightforward but I definitely may have missed something! https://github.com/apache/arrow-nanoarrow/blob/main/r/src/array.h#L80-L88


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1821131310

   I was wondering if you we could bring this back to the fore?  I am currently loosing my mind as a simple 'call from R' wrapper for the (nice) `linesplitter` example works just fine when I start in R and pass an external pointer down:
   
   ```c++
   // Plain Interface
   // [[Rcpp::export]]
   bool linesplit_from_R_plain(const std::string str, SEXP sxparr) {
       // We get an R-created 'nanoarrow_array', an S3 class around an external pointer
       if (!Rf_inherits(sxparr, "nanoarrow_array"))
           Rcpp::stop("Expected class 'nanoarrow_array' not found");
   
       // It is a straight up external pointer so we can use R_ExternalPtrAddr()
       struct ArrowArray* arr = (struct ArrowArray*)R_ExternalPtrAddr(sxparr);
   
       auto res = linesplitter_read(str, arr);
       return true;
   }
   ```
   
   But when I want to start from C++, I get lost somewhere build the `nanoarrow`-compliant external pointer up by itself.  It all works fine when I cheat (as the `nanoarrow` package also does in places) and call into R:
   
   ```c++
   // res <- linesplit_from_cpp("the\nquick\nbrown\nfox");
   // print(res);
   // print(arrow::Array$create(res))
   //
   // [[Rcpp::export]]
   Rcpp::XPtr<ArrowArray> linesplit_from_cpp(const std::string str) {
       Rcpp::Environment ns = Rcpp::Environment::namespace_env("nanoarrow");
       Rcpp::Function f1 = Rcpp::Function("nanoarrow_array_init", ns);
       Rcpp::Function f2 = Rcpp::Function("na_string", ns);
       auto sxparr = f1(f2());
   
       // It is an external pointer we can access, here with checking
       auto arr = xptr_get_ptr<ArrowArray>(sxparr, "nanoarrow_array");
       auto res = linesplitter_read(str, arr);
   
       auto s = Rcpp::XPtr<ArrowArray>{sxparr};
       // setting a tag somehow upsets the Arrow nature of things
       //xptr_set_tag(s, Rcpp::wrap(XPtrTagType<ArrowArray>));
       return s;
   }
   ```
   
   (Apologies for the small bits of `Rcpp` but that's I am most familiar with. The helpers used above are minimal wrappers around the C API of R as eg the following.)
   
   ```c++
   template <typename T> Rcpp::XPtr<T> make_xptr(SEXP p) {
       return Rcpp::XPtr<T>(p); 	// the default Rcpp::XPtr ctor with deleter on and tag and prot nil
   }
   ```
   
   So far so good but I still have two problems.  I can't seem to build an external pointer 'up from the C/C++ bases' to make `nanoarrow` happy.  It comes out ok, but invoking any of the R-level helpers goes astray.  A simple `print(res)` or `print(str(res))` of the return object gets an error of a failing allocation of a bazillion bytes.  The other (smaller) problem is that I good some use out of setting (and checking) external pointer tags, somehow doing that here with the one gotten by calling into R also dirties some other bit.
   
   By now there are nice bits and pieces of `nanoarrow` use now in the `adbc*` packages (and the repo), in `duckdb-r`, and possibly also in the geospatial apps (haven't looked).    The (vendorable) `nanoarrow.*` are fine _just_ C/C++ work.  The R interface is fine for work at the REPL prompt.   But there is much less on 'how to work with `nanoarrow` for R extensions' (and ditto for Python where support is slowly growing in the package).  So would there be some appetite to extend, say, what is in `nanoarrow.hpp` in light of possible 'interface helpers' ?  I'd be happy to help along, of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1822864318

   > As an aside you have a very very large part of nanoarrow hidden behind static which also leads the duckdb use to copy functions out
   
   I didn't know duckdb was using it! All `static` functions are internal and could change at any time (although I'm happy to expose functionality if there's something missing that can't be accessed by the functions in nanoarrow.h).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1822851926

   The example I show above is in the manual page (and in the R code via the usual means).  There are a few things I should clean up still -- I like have types other than `SEXP` in the signature, that requires the usual `inst/include/PACKAGENAME_types.h` but I realized I probably don't want all of `nanoarrow.h` there but maybe just the standard C API of Arrow part.
   
   (As an aside you have a very very large part of `nanoarrow` hidden behind `static` which also leads the `duckdb` use to copy functions out.  That ... can't really be ideal so maybe we can talk about 'liberating' some of those functions.)
   
   Otherwise happy to stick the example into the README / add a quick `tinytest` test predicate or two.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1822929054

   Yeah it's in https://github.com/duckdb/duckdb-r/tree/main/src/duckdb/src/common/adbc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #187: Show examples of building Arrow C++ structures

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1527803912

   That's a great idea! That mapping is definitely not obvious unless you've spent a lot of time with the C data interface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1821361448

   > Yes: the tag is where (perhaps inadvisably) the external pointer to the ArrowSchema is stored, 
   
   Dang.  Re-reading this sentence I think I once knew that too from reading your code but forgot.  That explains that part.   I can try to also put one there when I build up 'from the other direction'.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1821357534

   Thanks for the quick reply!  I am definitely up for brainstorming a little more and 'slowly but surely' expanding this.  I'll also try to clean up my trivial little wrapper around `linesplitter` so that we have a base to start.  Currently fighting other fires but interesting to see you say "Usually I just allocate from R".  I had reasonably good luck allocating directly (as per earlier exchanges and tickets) and have gotten that mostly by `valgrind` and other checkers. (It's a bit bizarre: all test script pass in isolation ie running one at a time, but when I check the package someone somewhere is unhappy and not cleaned.  As you say in a dry comment "almost impossible" to check without some framework so I am happy to lean on what you have here ... but I am stumbling a little at the gate.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1821193828

   >  setting a tag somehow upsets the Arrow nature of things
   
   Yes: the `tag` is where (perhaps inadvisably) the external pointer to the `ArrowSchema` is stored, and I'm not sure the class is checked everywhere so perhaps the failing allocation of a bazillion bytes is because it's being misinterpted somehow. The `protected` member is where you should set any SEXP dependency, with the caveat that if there is already one there you have to maintain a reference to it (e.g., `list(your_new_sexp_dep, old_sexp_dep)`). You can use `nanoarrow_pointer_export()` to wrap the array in another array that maintains the reference via the `release()` callback instead of via the `prot` tag. All of that is undocumented, of course...I didn't expect this level of internal use quite yet but obviously it should be clear 🙂 .
   
   > So would there be some appetite to extend, say, what is in nanoarrow.hpp in light of possible 'interface helpers' ?
   
   For R-specific helpers, perhaps a header like the one you mentioned in `r/inst/include/nanoarrow/nanoarrow_r.h|hpp`? Even if all they do is `Rf_eval()` to call into R in the first iteration (we can make them faster later if the call into R is limiting). Usually I just allocate from R and pass the `array_xptr` into the C/C++ function (e.g., https://github.com/apache/arrow-adbc/blob/main/r/adbcdrivermanager/R/adbc.R#L180-L191 ). At the very least, a copy of the C Data/Stream structures would be helpful.
   
   For Python, we generate Cython definitions (`nanoarrow_c.pxd`), and I've wondered if it's worth putting that in `dist/` (Python extensions can just copy nanoarrow.h, nanoarrow.c, and nanoarrow_c.pxd and it's reasonably easy to wrap from there (Cython is considerably easier as a glue language to anything we have in R). This is what I've done in https://github.com/geoarrow/geoarrow-c/tree/main/python . nanoarrow for Python will probably serve a similar purpose (help with the allocating). In Python there is also https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html , which is more or less what `as_nanoarrow_XXX()` is trying to do.
   
   For C++ (i.e., `nanoarrow.hpp`), there is almost certainly more that could be useful, although I am hesitant to increase the scope beyond not leaking memory. It might be worth drafting an internal set of helpers you developed and used somewhere and linking to it here? (Or maybe that's not what you had in mind)
   
   > 'how to work with nanoarrow for R extensions' 
   
   This should definitely be a vignette/article! As you noted there is now ADBC and soon geoarrow (and whatever you are up to!) which are the first few test cases. Porting the linesplitter example would be a great place to start, as you noted (i.e., here's how you'd wrap the function in an R package...).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Show examples of building Arrow C++ structures [arrow-nanoarrow]

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #187:
URL: https://github.com/apache/arrow-nanoarrow/issues/187#issuecomment-1821847487

   Tried setting the schema external pointer as a tag, but no luck so far.
   
   I put a (truly minimal) example up here. It builds and checks cleanly for me, and has one example with either `nanoarrow` (required) or `arrow` (suggested) export:
   
   ```r
   > example(linesplit, package="linesplitter")
   
   lnsplt> txt <- "the\nquick\nbrown\nfox"
   
   lnsplt> linesplit(txt)
   <nanoarrow_array string[4]>
    $ length    : int 4
    $ null_count: int 0
    $ offset    : int 0
    $ buffers   :List of 3
     ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
     ..$ :<nanoarrow_buffer data_offset<int32>[5][20 b]> `0 3 8 13 16`
     ..$ :<nanoarrow_buffer data<string>[16 b]> `thequickbrownfox`
    $ dictionary: NULL
    $ children  : list()
   
   lnsplt> linesplit(txt, TRUE)
   Array
   <string>
   [
     "the",
     "quick",
     "brown",
     "fox"
   ]
   > 
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org