You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/21 22:36:31 UTC

[GitHub] [arrow] francisco-ixpantia opened a new issue, #14476: How to define a StructArray from R?

francisco-ixpantia opened a new issue, #14476:
URL: https://github.com/apache/arrow/issues/14476

   
   According to BigTable's documentation https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#parquetfiletocloudbigtable the necessary schema (in AVRO's format) is as follows:
   
   ```
   {
       "name" : "BigtableRow",
       "type" : "record",
       "namespace" : "com.google.cloud.teleport.bigtable",
       "fields" : [
         { "name" : "key", "type" : "bytes"},
         { "name" : "cells",
           "type" : {
             "type" : "array",
             "items": {
               "name": "BigtableCell",
               "type": "record",
               "fields": [
                 { "name" : "family", "type" : "string"},
                 { "name" : "qualifier", "type" : "bytes"},
                 { "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
                 { "name" : "value", "type" : "bytes"}
               ]
             }
           }
         }
      ]
   }
   ```
   
   This schema using R's Arrow library https://arrow.apache.org/docs/r/ is set up as follows:
   
   ```R
   library(arrow)
   
   bigtablecell <- struct(
     family = string(),
     qualifier = binary(),
     timestamp = timestamp(unit = "ms"),
     value = binary()
   )
   bigtablerow <- schema(key = binary(), cells = list_of(bigtablecell))
   
   bigtablecell_schema <- schema(bigtablecell = bigtablecell)
   ```
   
   The problem is in how to build from R a parquet file that fits this schema, to represent a single row the furthest I have come is the following:
   
   ```R
   bigtablecells_test <- Array$create(
     list(
       tibble(
         family = family,
         qualifier = Array$create("filter_id")$cast(binary())$as_vector(),
         timestamp = Array$create(1234567890L, type = int64())$cast(timestamp("ms"))$as_vector(),
         value = Array$create("0cd714fd-f6e8-4b76-aa16-1655b83e6148")$cast(binary())$as_vector()
       ),
       tibble(
         family = family,
         qualifier = Array$create("user_id")$cast(binary())$as_vector(),
         timestamp = Array$create(1234567891L, type = int64())$cast(timestamp("ms"))$as_vector(),
         value = Array$create("1655")$cast(binary())$as_vector()
       )
     )
   )$as_vector()
   
   key_test <- Array$create("1")$cast(binary())$as_vector()
   data <- tibble(key = key_test, cells = bigtablecells_test) # returns two rows!! 😢
   tab <- arrow_table(data, schema = bigtablerow)
   ```
   
   It happens that the sequence `Array$create(list(tibble(), tibble()))` is made in order to build the`list_of(bigtablecell)` part of the schema which is a `StructArray`, but in R I cannot create it directly as `StructArray$create()` but I have to rely on `Array$create()` to automatically detect the content by the value I pass as parameter.
   
   However, I have not been able to get the content I pass to `Array$create` to be properly interpreted as a `StructArray`, since when I convert it to a tibble it becomes two rows, instead of a single row with two columns. I have tried with Scalar, `ChunkedArray` in different combinations and I have not succeeded.
   
   Consequently, I request an example to define a `StructArray` from R. And any information you consider relevant to build parquet files from R. In addition to the official documentation I have followed in great detail the articles in Danielle Navarro's blog as https://blog.djnavarro.net/posts/2022-05-25_arrays-and-tables-in-arrow/ without finding examples for building `StructArray`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #14476: How to define a StructArray from R?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #14476:
URL: https://github.com/apache/arrow/issues/14476#issuecomment-1292812367

   We really should have an easy `StructArray$create()`. We have a binding for this in dplyr, which allows a workaround until we add this properly!
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   make_struct <- function(...) {
     args <- rlang::enquos(..., .named = TRUE)
     syms <- rlang::set_names(rlang::syms(names(args)), names(args))
     table <- arrow_table(tibble::tibble(!!! args)) |> 
       dplyr::transmute(tibble::tibble(!!! syms)) |> 
       dplyr::compute()
     as_arrow_array(table[[1]])
   }
   
   make_struct(a = 1:5, b = letters[1:5])
   #> StructArray
   #> <struct<a: int32, b: string>>
   #> -- is_valid: all not null
   #> -- child 0 type: int32
   #>   [
   #>     1,
   #>     2,
   #>     3,
   #>     4,
   #>     5
   #>   ]
   #> -- child 1 type: string
   #>   [
   #>     "a",
   #>     "b",
   #>     "c",
   #>     "d",
   #>     "e"
   #>   ]
   ```
   
   <sup>Created on 2022-10-26 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #14476: How to define a StructArray from R?

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #14476:
URL: https://github.com/apache/arrow/issues/14476#issuecomment-1397634172

   `StructArray$create()` was added in #31660.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #14476: How to define a StructArray from R?

Posted by GitBox <gi...@apache.org>.
thisisnic commented on issue #14476:
URL: https://github.com/apache/arrow/issues/14476#issuecomment-1293360235

   Looks like we have a ticket open on the project JIRA to implement this, but nobody's looking at it right now: https://issues.apache.org/jira/browse/ARROW-16266


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 closed issue #14476: How to define a StructArray from R?

Posted by GitBox <gi...@apache.org>.
wjones127 closed issue #14476: How to define a StructArray from R?
URL: https://github.com/apache/arrow/issues/14476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org