You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/06 19:35:49 UTC

[GitHub] [arrow] nealrichardson opened a new pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

nealrichardson opened a new pull request #7648:
URL: https://github.com/apache/arrow/pull/7648


   In terms of number of lines of code, this wasn't bad, though I don't know how efficient these methods are. Maybe there's a better way
   
   The one thing that would be lost is any metadata attached to the Table schema because the Table is reconstructed from its ChunkedArrays without schema. I wonder if we could export the Schema on its own--the existing `_export_to_c`/`_import_from_c` methods take both an array pointer and a schema pointer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #7648:
URL: https://github.com/apache/arrow/pull/7648


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654729844


   The main source of potential inefficiency here is that the Schema is exported/imported once for each chunk. We may or may not case immediately about this.
   
   Also, note that you can transfer a Table as a sequence of RecordBatches, rather than a sequence of ChunkedArrays. If you have many many columns, it would probably be more efficient.
   
   (in C++, you can use `TableBatchReader` to export a Table as RecordBatches, and `Table::FromRecordBatches` to recreate a Table from the RecordBatches)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-655026342


   Ok then I'll merge this and we can come back and improve it later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654999245


   @wesm do you have an opinion on this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-655004137


   As long as the implementation details / semantics aren't exposed (they don't seem to be), this seems sufficient to me to have the feature established, and we an always return later and make it more efficient (or replace it with the iterator mechanism that will hopefully have some discussion on the ML)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654929934


   > Also, note that you can transfer a Table as a sequence of RecordBatches, rather than a sequence of ChunkedArrays. 
   
   But Tables *are* a sequence of ChunkedArrays, right? Those ChunkedArrays may not be chunked the same way, and dictionaries within a ChunkedArray may not be the same, etc. Can TableBatchReader handle those cases robustly and with zero-copy? 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654730028


   (otherwise, the code here looks ok, but I'm not a R expert at all :-))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654431173


   https://issues.apache.org/jira/browse/ARROW-8301


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654954793


   However, it may use the offset member of arrays, which might not work with the C Data Interface...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#discussion_r450932956



##########
File path: r/R/python.R
##########
@@ -73,6 +73,44 @@ r_to_py.RecordBatch <- function(x, convert = FALSE) {
   out
 }
 
+r_to_py.ChunkedArray <- function(x, convert = FALSE) {
+  # Import with convert = FALSE so that `_import_from_c` returns a Python object
+  pa <- reticulate::import("pyarrow", convert = FALSE)
+  out <- pa$chunked_array(x$chunks)
+  # But set the convert attribute on the return object to the requested value
+  assign("convert", convert, out)
+  out
+}
+
+py_to_r.pyarrow.lib.ChunkedArray <- function(x, ...) {
+  ChunkedArray$create(!!!maybe_py_to_r(x$chunks))
+}
+
+r_to_py.Table <- function(x, convert = FALSE) {
+  # Import with convert = FALSE so that `_import_from_c` returns a Python object
+  pa <- reticulate::import("pyarrow", convert = FALSE)
+  out <- pa$Table$from_arrays(x$columns, names = names(x))
+  # But set the convert attribute on the return object to the requested value
+  assign("convert", convert, out)
+  out
+}
+
+py_to_r.pyarrow.lib.Table <- function(x, ...) {
+  colnames <- maybe_py_to_r(x$column_names)
+  r_cols <- maybe_py_to_r(x$columns)
+  names(r_cols) <- colnames
+  Table$create(!!!r_cols)

Review comment:
       It's like doing `**kwargs`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#discussion_r450734963



##########
File path: r/R/python.R
##########
@@ -73,6 +73,44 @@ r_to_py.RecordBatch <- function(x, convert = FALSE) {
   out
 }
 
+r_to_py.ChunkedArray <- function(x, convert = FALSE) {
+  # Import with convert = FALSE so that `_import_from_c` returns a Python object
+  pa <- reticulate::import("pyarrow", convert = FALSE)
+  out <- pa$chunked_array(x$chunks)
+  # But set the convert attribute on the return object to the requested value
+  assign("convert", convert, out)
+  out
+}
+
+py_to_r.pyarrow.lib.ChunkedArray <- function(x, ...) {
+  ChunkedArray$create(!!!maybe_py_to_r(x$chunks))
+}
+
+r_to_py.Table <- function(x, convert = FALSE) {
+  # Import with convert = FALSE so that `_import_from_c` returns a Python object
+  pa <- reticulate::import("pyarrow", convert = FALSE)
+  out <- pa$Table$from_arrays(x$columns, names = names(x))
+  # But set the convert attribute on the return object to the requested value
+  assign("convert", convert, out)
+  out
+}
+
+py_to_r.pyarrow.lib.Table <- function(x, ...) {
+  colnames <- maybe_py_to_r(x$column_names)
+  r_cols <- maybe_py_to_r(x$columns)
+  names(r_cols) <- colnames
+  Table$create(!!!r_cols)

Review comment:
       Out of curiosity, what does the `!!!` mean?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7648: ARROW-8301: [R] Handle ChunkedArray and Table in C data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7648:
URL: https://github.com/apache/arrow/pull/7648#issuecomment-654954241


   It should :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org