You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/22 15:38:50 UTC

[GitHub] [arrow-nanoarrow] eddelbuettel opened a new issue, #67: [R] Add some C level examples?

eddelbuettel opened a new issue, #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67

   We had some very good experience with some of the (never released) predecessor packages that came before this one.  One key use case is to turn allocated buffers of the two well-understood `struct`s that form the official C API into Arrow objects. 
   
   Being able to do this with a 'narrower' package is really nice.  And this package is coming along nicely and getting but I am getting lost in the changes between it and the predecessors.  
   
   Do you foresee adding some examples from 'deeper down' at the C level?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337421412

   Hi @paleolimbot -- the list in #71 looks great, and the recent PR also containing streams are very useful too.   Sadly there is still too much different in `nanoarrow` relative to your various predecessor projects (with still work) for me to make a switch.  I took the above and extended it minimally.  I get what looks like a reasonable Arrow object but trying to materialize it into a `tibble` blows up (whereas this works just fine as it should with the predecessors):
   
   ```sh
   > rl <- rcppnanoarrow::createArray()
   > unclass(rl)
   $schema
   <pointer: 0x55e569b9d0f0>
   attr(,"class")
   [1] "arch_schema"
   
   $array_data
   <pointer: 0x55e5690ec1a0>
   attr(,"class")
   [1] "arch_array_data"
   
   > arrow::as_arrow_table(arch::from_arch_array(rl, arrow::RecordBatch))
   Table
   3 rows x 2 columns
   $intCol <int32>
   $dblCol <double>
   > tibble::as_tibble(arrow::as_arrow_table(arch::from_arch_array(rl, arrow::RecordBatch)))
   
    *** caught segfault ***
   address (nil), cause 'memory not mapped'
   
   Traceback:
    1: vec_slice(x, seq_len(n))
    2: vec_head(as.data.frame(x), n)
    3: df_head(x, n)
    4: tbl_format_setup.tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = focus)
    5: tbl_format_setup_dispatch(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = focus)
    6: tbl_format_setup(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = attr(x, "pillar_focus"))
    7: format_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
    8: format.tbl(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
    9: format(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
   10: writeLines(format(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines))
   11: print_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
   12: print.tbl(x)
   13: (function (x, ...) UseMethod("print"))(x)
   
   Possible actions:
   1: abort (with core dump, if enabled)
   2: normal R exit
   3: exit R without saving workspace
   4: exit R saving workspace
   Selection: 
   ```
   
   The simple helper function to create the object is below, it is a simple extension of the example stub you posted.  Can you spot what I am missing here? (And I tried to different s3 classes for dispatch.)  
   
   <details>
   
   Simple Rcpp-wrapped object creator below. It uses a list for convenience to transport the two external pointers to schema and array data, following prior practice in narrow/sparrow/carrow/... 
   
   ```c++
   //' @export
   // [[Rcpp::export]]
   Rcpp::List createArray() {
       const int ncol = 2;
       const int nrow = 3;
       auto schemaxp = Rcpp::XPtr<struct ArrowSchema>(new struct ArrowSchema);
       schemaxp.attr("class") = "arch_schema";
       auto schema = schemaxp.get();
       ArrowSchemaInit(schema, NANOARROW_TYPE_STRUCT);
       ArrowSchemaAllocateChildren(schema, ncol);
   
       auto arrayxp = Rcpp::XPtr<struct ArrowArray>(new struct ArrowArray);
       arrayxp.attr("class") = "arch_array_data";
       auto array = arrayxp.get();
       ArrowArrayInit(array, NANOARROW_TYPE_STRUCT);
       ArrowArrayAllocateChildren(array, ncol);
       array->length = nrow;
       array->null_count = -1;
   
       // ...fill in schema.children and array.children
       ArrowSchemaInit(schema->children[0], NANOARROW_TYPE_INT32);
       ArrowSchemaSetName(schema->children[0], "intCol");
       ArrowArrayInit(array->children[0], NANOARROW_TYPE_INT32);
       ArrowArrayAppendInt(array->children[0], 21);
       ArrowArrayAppendInt(array->children[0], 42);
       ArrowArrayAppendInt(array->children[0], 63);
   
       ArrowSchemaInit(schema->children[1], NANOARROW_TYPE_DOUBLE);
       ArrowSchemaSetName(schema->children[1], "dblCol");
       ArrowArrayInit(array->children[1], NANOARROW_TYPE_DOUBLE);
       ArrowArrayAppendDouble(array->children[1], 21.1);
       ArrowArrayAppendDouble(array->children[1], 42.2);
       ArrowArrayAppendDouble(array->children[1], 63.3);
   
       Rcpp::List as = Rcpp::List::create(Rcpp::Named("schema") = schemaxp,
                                          Rcpp::Named("array_data") = arrayxp);
       //as.attr("class") = "nanoarrow_array";
       as.attr("class") = "arch_array";
       return as;
   }
   
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337485050

   Doh! My bad. I did of course look at / build / run those minimal examples, and my fault for not carrying the start/finish helper over, will do so now.  (Of course, what I _really_ want (eventually) is to be able to inject full buffers of data per `memcpy` to use `nanoarrow` as a lighterweight replacement for Arrow in the creation steps of some data pipelines.)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337497650

   Thanks so much:
   
   ```r
   > rl <- rcppnanoarrow::createArray()
   > rl
   $schema
   <pointer: 0x55aa74b08ee0>
   attr(,"class")
   [1] "arch_schema"
   
   $array_data
   <pointer: 0x55aa76febc90>
   attr(,"class")
   [1] "arch_array_data"
   
   attr(,"class")
   [1] "arch_array"
   > tibble::as_tibble(arch::from_arch_array(rl, arrow::RecordBatch))
   # A tibble: 3 × 2
     intCol dblCol
      <int>  <dbl>
   1     21   21.1
   2     42   42.2
   3     63   63.3
   > 
   ```
   
   'Repaired' code below.  
   
   <details>
   
   ```c++
   //' @export
   // [[Rcpp::export]]
   Rcpp::List createArray() {
       const int ncol = 2;
       const int nrow = 3;
       auto schemaxp = Rcpp::XPtr<struct ArrowSchema>(new struct ArrowSchema);
       schemaxp.attr("class") = "arch_schema";
       auto schema = schemaxp.get();
       ArrowSchemaInit(schema, NANOARROW_TYPE_STRUCT);
       ArrowSchemaAllocateChildren(schema, ncol);
   
       auto arrayxp = Rcpp::XPtr<struct ArrowArray>(new struct ArrowArray);
       arrayxp.attr("class") = "arch_array_data";
       auto array = arrayxp.get();
       ArrowArrayInit(array, NANOARROW_TYPE_STRUCT);
       ArrowArrayAllocateChildren(array, ncol);
       array->length = nrow;
       array->null_count = -1;
   
       // ...fill in schema.children and array.children
       ArrowSchemaInit(schema->children[0], NANOARROW_TYPE_INT32);
       ArrowSchemaSetName(schema->children[0], "intCol");
       ArrowArrayInit(array->children[0], NANOARROW_TYPE_INT32);
       ArrowArrayStartAppending(array->children[0]);
       ArrowArrayAppendInt(array->children[0], 21);
       ArrowArrayAppendInt(array->children[0], 42);
       ArrowArrayAppendInt(array->children[0], 63);
       ArrowArrayFinishBuilding(array->children[0], nullptr);
   
       ArrowSchemaInit(schema->children[1], NANOARROW_TYPE_DOUBLE);
       ArrowSchemaSetName(schema->children[1], "dblCol");
       ArrowArrayInit(array->children[1], NANOARROW_TYPE_DOUBLE);
       ArrowArrayStartAppending(array->children[1]);
       ArrowArrayAppendDouble(array->children[1], 21.1);
       ArrowArrayAppendDouble(array->children[1], 42.2);
       ArrowArrayAppendDouble(array->children[1], 63.3);
       ArrowArrayFinishBuilding(array->children[1], nullptr);
   
       Rcpp::List as = Rcpp::List::create(Rcpp::Named("schema") = schemaxp,
                                          Rcpp::Named("array_data") = arrayxp);
       //as.attr("class") = "nanoarrow_array";
       as.attr("class") = "arch_array";
       return as;
   }
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1323878359

   Yes, there are changes from 'narrow' that reflect my understanding of the C data/stream interfaces and the capabilities of the nanoarrow C library. The use case of creating ArrowArrays from buffers is hugely important and will be added before a CRAN release.
   
   Can you give an example of the code you would like to write at the C level? I can envision a use-case where the buffers were never R objects to begin with (e.g., allocated with malloc or std::vector) but that you would like to create an ArrowArray from them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by "eddelbuettel (via GitHub)" <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1583027779

   "Moar" C++ examples would be good too, as well as "moar" example for higher-level data structures but last I checked there are already tickets open for that so all good.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1324185749

   Cool! I will open a "roadmap" issue shortly...it should exist somewhere.
   
   I think what you are after might be a good fit for the C library ( see https://apache.github.io/arrow-nanoarrow/dev/c.html#creating-arrays ), which would let you construct a RecordBatch with perhaps less indirection. Your example might be something like:
   
   ```c
   struct ArrowSchema schema;
   ArrowSchemaInit(&schema, NANOARROW_TYPE_STRUCT);
   ArrowSchemaAllocateChildren(&schema, ncol);
   
   struct ArrowArrray array;
   ArrowArrayInit(&array, NANOARROW_TYPE_STRUCT);
   ArrowArrayAllocateChildren(&array, ncol);
   array.length = rows;
   array.null_count = -1;
   for (size_t i=0; i<ncol; i++) {
     // ...fill in schema.children and array.children
   }
   ```
   
   That said, as long as the external pointers carry the class `"nanoarrow_schema"`, `"nanoarrow_array"` and `"nanoarrow_array_stream"` I think an "arch" package that provides that type of C API would happily augment anything that exists in nanoarrow. I don't envision providing access to C callables in the initial release, mostly because my development focus will be on more user-facing behaviour (like ALTREP conversions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337518786

   Excellent. I did of course svanger-hunt among the tests as we're all familiar with the 'tests are docs' pattern but miss that.  Will try to play along.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] eddelbuettel commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
eddelbuettel commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1323909187

   Absolutely. I am on a call now, and now of what we do is hidden in a branch of a public repo but another I looked this morning follows the same scheme and in C(++) I do 
   
   ```r
       for (size_t i=0; i<ncol; i++) {
           // this allocates, and properly wraps as external pointers controlling lifetime
           SEXP schemaxp = arch_c_allocate_schema();
           SEXP arrayxp = arch_c_allocate_array_data();
   
           // now buf is a shared_ptr to ColumnBuffer
           auto buf = sr_data->get()->at(names[i]);
   
           // this is pair of array and schema pointer          
           auto pp = tdbs::ArrowAdapter::to_arrow(buf);
   
           memcpy((void*) R_ExternalPtrAddr(schemaxp), pp.second.get(), sizeof(ArrowSchema));
           memcpy((void*) R_ExternalPtrAddr(arrayxp), pp.first.get(), sizeof(ArrowArray));
   
           schlst[i] = schemaxp;
           arrlst[i] = arrayxp;
       }
   
       struct ArrowArray* array_data_tmp = (struct ArrowArray*) R_ExternalPtrAddr(arrlst[0]);
       int rows = static_cast<int>(array_data_tmp->length);
       SEXP sxp = arch_c_schema_xptr_new(Rcpp::wrap("+s"), 	// format
                                         Rcpp::wrap(""),   	// name
                                         Rcpp::List(),       	// metadata
                                         Rcpp::wrap(2),      	// flags, 2: unord., nullable, no sorted map
                                         schlst, 	        	// children
                                         R_NilValue);        	// dictionary
       SEXP axp = arch_c_array_from_sexp(Rcpp::List::create(Rcpp::Named("")=R_NilValue), // buffers
                                         Rcpp::wrap(rows), 	// length
                                         Rcpp::wrap(-1), 	    // null count, -1 means not determined
                                         Rcpp::wrap(0),    	// offset (in bytes)
                                         arrlst,               // children
                                         R_NilValue);          // dictionary
       Rcpp::List as = Rcpp::List::create(Rcpp::Named("schema") = sxp,
                                          Rcpp::Named("array_data") = axp);
       as.attr("class") = "arch_array";
       return as;
   ```
   
   I called this `arch` to not step on your toes but it is essentially your `narrow`.  In R it is then 
   
   ```r
       dat |> arch::from_arch_array(arrow::RecordBatch) |> arrow::as_arrow_table() |> dplyr::collect() -> D
   ```
   
   All this works, in a basic manner, and I did similar exercise for RecordBatches 'in pieces'.  I have not gotten to ALTREP and other extensions we all would love to have here too.  So `nanoarrow` is really exciting -- but I miss a roadmap (not in the sense of what will come,  but in the helping me to get going).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337508185

   All good! I don't have an example for that pattern yet in examples/ but there's a test that does that: https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/array_test.cc#L182-L231
   
   If you want/need to avoid the `memcpy()`, the "deallocator" pattern may be helpful. I haven't used it yet in anything except dinky tests but the idea is to use it in the R package to avoid copying vectors (just like narrow did): https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/utils_test.cc#L104-L127
   
   > nanoarrow as a lighterweight replacement for Arrow in the creation steps of some data pipelines.
   
   That's the whole idea! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #67: [R] Add some C level examples?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1337471166

   Is there anything I should add to the checklist in #71? I believe the "creating from buffers" and "manually creating schemas" are the two main things that are missing and I will add them before the release. Because of the nature of an "Apache Release", which has a very specific definition, I have to implement a bunch of stuff in the C library and expand its test coverage a bit before I can make an official "nanoarrow" CRAN release.
   
   In your example, I believe that you are missing `ArrowArrayStartAppending()` and `ArrowArayFinishBuilding()`. Also note that all of these may return something other than `NANOARROW_OK` (i.e., zero) which would indicate a failure (normally a failed allocation). You can use the `NANOARROW_RETURN_NOT_OK()` macro to keep that from becoming unreadable. There's an example with this in the examples/ directory: https://github.com/apache/arrow-nanoarrow/blob/main/examples/vendored-minimal/src/library.c#L31-L47


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #67: [R] Add some C level examples?

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #67:
URL: https://github.com/apache/arrow-nanoarrow/issues/67#issuecomment-1583012083

   I'm considering this "closed" in connection with the recent documentation updates for the purposes of bookkeeping before the forthcoming release; however, feel free to open up another issue with any more proposed improvements!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-nanoarrow] paleolimbot closed issue #67: [R] Add some C level examples?

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot closed issue #67: [R] Add some C level examples?
URL: https://github.com/apache/arrow-nanoarrow/issues/67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org