You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/13 18:41:53 UTC

[GitHub] [arrow] nealrichardson commented on a change in pull request #9182: ARROW-10386: [R] List column class attributes not preserved in roundtrip

nealrichardson commented on a change in pull request #9182:
URL: https://github.com/apache/arrow/pull/9182#discussion_r556739981



##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the

Review comment:
       "When converting a `data.frame` to an Arrow `Table` or `RecordBatch`, "

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:

Review comment:
       This section should either be called something like "R Metadata", or it should start by discussing the key-value metadata more generally.

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).

Review comment:
       I believe it is Pandas only that stores extra metadata, not pyarrow itself.

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for

Review comment:
       According to the code, this is only true for list columns (which makes sense because regular vectors can't have attributes on elements)

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.

Review comment:
       Shouldn't say table here, we're in the `Schema` docs.
   
   ```suggestion
   #'   This metadata is stored in `$metadata$r`.
   ```

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.
+#'
+#'   This metadata is saved by serializing R's attribute list structure to a

Review comment:
       "Since Schema metadata keys and values must be strings, ..."

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.

Review comment:
       ```suggestion
   #'   (column names and types), which is compatible with other Arrow clients.
   ```

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.
+#'
+#'   This metadata is saved by serializing R's attribute list structure to a
+#'   serialized string. Because of this, large amounts of metadata can quickly
+#'   increase the size of tables (and therefore the size of tables written to
+#'   parquet or feather files). If the (serialized) metadata exceeds 100Kbs in
+#'   size, it is first compressed before saving. To disable this compression
+#'   (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
+#'   include large amounts of metadata) you can set the option
+#'   `arrow.compress_metadata` to `FALSE`.
+#'
+#'   One exception to storing all metadata: `readr`'s `problems` attribute if it

Review comment:
       I don't think this paragraph is necessary.

##########
File path: r/R/schema.R
##########
@@ -50,6 +50,34 @@
 #' - `$metadata`: returns the key-value metadata as a named list.
 #'    Modify or replace by assigning in (`sch$metadata <- new_metadata`).
 #'    All list elements are coerced to string.
+#'
+#' @section Metadata:
+#'
+#'   Attributes from the `data.frame` are saved alongside tables so that the
+#'   object can be reconstructed faithfully in R (e.g. with `as.data.frame()`).
+#'   This metadata can be both at the top-level of the `data.frame` (e.g.
+#'   `attributes(df)`) or at the column (e.g. `attributes(df$col_a)`) or element
+#'   level (e.g. `attributes(df[1, "col_a"])`). For example, this allows for
+#'   storing `haven` columns in a table and being able to faithfully re-create
+#'   them when pulled back into R. This metadata is separate from the schema
+#'   (e.g. types of the columns) which is compatible with other Arrow clients.
+#'   The R metadata is only read by R and is ignored by other clients (e.g.
+#'   pyarrow which has its own custom metadata for things like Pandas metadata).
+#'   This metadata is stored (and can be accessed with) `table$metadata$r`.
+#'
+#'   This metadata is saved by serializing R's attribute list structure to a
+#'   serialized string. Because of this, large amounts of metadata can quickly
+#'   increase the size of tables (and therefore the size of tables written to
+#'   parquet or feather files). If the (serialized) metadata exceeds 100Kbs in
+#'   size, it is first compressed before saving. To disable this compression
+#'   (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
+#'   include large amounts of metadata) you can set the option
+#'   `arrow.compress_metadata` to `FALSE`.

Review comment:
       ```suggestion
   #'   string. If the serialized metadata exceeds 100Kbs in size, by default
   #'   it is compressed starting in version 3.0.0. To disable this compression
   #'   (e.g. for tables that are compatible with Arrow versions before 3.0.0 and
   #'   include large amounts of metadata), set the option
   #'   `arrow.compress_metadata` to `FALSE`. Files with compressed metadata
   #'   are readable by older versions of arrow, but the metadata is dropped.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org