You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/18 22:23:37 UTC

[GitHub] [arrow] paleolimbot opened a new pull request #12467: ARROW-15471: [R] ExtensionType support in R

paleolimbot opened a new pull request #12467:
URL: https://github.com/apache/arrow/pull/12467


   This PR implements extension type support and registration in the R bindings (as has been possible in the Python bindings for some time). The details still need to be worked out, but we at least have a working pattern:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(R6)
   
   SomeExtensionTypeSubclass <- R6Class(
     "SomeExtensionTypeSubclass", inherit = arrow:::ExtensionType,
     public = list(
       some_custom_method = function() {
         private$some_custom_field
       },
       
       .Deserialize = function(storage_type, extension_name, extension_metadata) {
         private$some_custom_field <- head(extension_metadata, 5)
       }
     ),
     private = list(
       some_custom_field = NULL
     )
   )
   
   SomeExtensionArraySubclass <- R6Class(
     "SomeExtensionArraySubclass", inherit = arrow:::ExtensionArray,
     public = list(
       some_custom_method = function() {
         self$type$some_custom_method()
       }
     )
   )
   
   type <- arrow:::MakeExtensionType(
     int32(),
     "some_extension_subclass",
     charToRaw("some custom metadata"),
     type_class = SomeExtensionTypeSubclass,
     array_class = SomeExtensionArraySubclass
   )
   
   
   arrow:::RegisterExtensionType(type)
   
   # survives the C API round trip
   ptr_type <- arrow:::allocate_arrow_schema()
   type$export_to_c(ptr_type)
   type2 <- arrow:::DataType$import_from_c(ptr_type)
   
   
   type2
   #> SomeExtensionTypeSubclass
   #> SomeExtensionTypeSubclass <some custom metadata>
   type2$some_custom_method()
   #> [1] 73 6f 6d 65 20
   
   (array <- type$WrapArray(Array$create(1:10)))
   #> SomeExtensionArraySubclass
   #> <SomeExtensionTypeSubclass <some custom metadata>>
   #> [
   #>   1,
   #>   2,
   #>   3,
   #>   4,
   #>   5,
   #>   6,
   #>   7,
   #>   8,
   #>   9,
   #>   10
   #> ]
   array$some_custom_method()
   #> [1] 73 6f 6d 65 20
   
   ptr_array <- arrow:::allocate_arrow_array()
   array$export_to_c(ptr_array, ptr_type)
   (array2 <- Array$import_from_c(ptr_array, ptr_type))
   #> SomeExtensionArraySubclass
   #> <SomeExtensionTypeSubclass <some custom metadata>>
   #> [
   #>   1,
   #>   2,
   #>   3,
   #>   4,
   #>   5,
   #>   6,
   #>   7,
   #>   8,
   #>   9,
   #>   10
   #> ]
   
   arrow:::delete_arrow_schema(ptr_type)
   arrow:::delete_arrow_array(ptr_array)
   ```
   
   <sup>Created on 2022-02-18 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r838772015



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       hydrate & dehydrate are common terms I've seen for this type of operation too: https://stackoverflow.com/questions/6991135/what-does-it-mean-to-hydrate-an-object




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r831050623



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray

Review comment:
       I actually have no idea what `#' @include arrow-package.R` is doing (it was copy/pasted from the DataType documentation). Do you know what it does?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r832101209



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$.array_as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inheritx(x, "ExtensionArray") && type$Equals(x$type)) {

Review comment:
       Good catch! (Now fixed and tested!)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075362907


   I think so! This example is probably better than the example I have in there right now because the serializing/deserializing of the metadata is a big part of the picture and the current documentation example only implements the array-to-r conversion. Check to make sure it's what you meant though!
   
   I didn't implement this quite in the same way as the Python one...I think in Python the workflow (and correct me if I'm wrong) is along the lines of
   
   - Create a `ExtensionTypeSubclass(with, parameters, like, this)` instance
   - C++ calls the `__arrow_ext_serialize__` method of the Python instance when the serialized metadata is needed
   
   In R it's totally bananas to call from C++ into R and we can't do it safely most of the time. So instead I wrote it like:
   
   - Create the extension metadata before either the R or C++ instance is created
   - Create a C++ instance of `RExtensionType()` that contains the definitive copy of the serialized extension metadata
   - Create the ExtensionTypeSubclass R6 instance and then call the R6 instance's `.Deserialize()` method to populate data fields.
   
   It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I *like* how it's implemented (but I'm also not sure how to make it better).
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   QuantizedType <- R6::R6Class(
     "QuantizedType", 
     inherit = ExtensionType,
     public = list(
       center = function() private$.center,
       scale = function() private$.scale,
       
       .array_as_vector = function(extension_array) {
         as.vector(extension_array$storage() / private$.scale + private$.center)
       },
       
       .Deserialize = function(storage_type, extension_name, extension_metadata) {
         parsed <- jsonlite::fromJSON(self$extension_metadata_utf8())
         private$.center <- as.double(parsed$center)
         private$.scale <- as.double(parsed$scale)
       }
     ),
     private = list(
       .center = NULL,
       .scale = NULL
     )
   )
   
   quantized <- function(center = 0, scale = 1, storage_type = int32()) {
     new_extension_type(
       storage_type = storage_type,
       extension_name = "arrow.example.quantized",
       extension_metadata = jsonlite::toJSON(
         list(
           center = jsonlite::unbox(as.double(center)),
           scale = jsonlite::unbox(as.double(scale))
         )
       ),
       type_class = QuantizedType
     )
   }
   
   quantized_array <- function(x, center = 0, scale = 1, 
                               storage_type = int32()) {
     type <- quantized(center, scale, storage_type)
     new_extension_array(
       Array$create((x - center) * scale, type = storage_type),
       type
     )
   }
   
   reregister_extension_type(quantized())
   
   (vals <- runif(5, min = 19, max = 21))
   #> [1] 19.33526 19.47467 19.14288 20.39798 19.04523
   
   (array <- quantized_array(
     vals,
     center = 20,
     scale = 2 ^ 15 - 1,
     storage_type = int16())
   )
   #> ExtensionArray
   #> <QuantizedType <{"center":20,"scale":32767}>>
   #> [
   #>   -21781,
   #>   -17213,
   #>   -28085,
   #>   13040,
   #>   -31284
   #> ]
   
   array$type$center()
   #> [1] 20
   array$type$scale()
   #> [1] 32767
   
   as.vector(array)
   #> [1] 19.33528 19.47468 19.14289 20.39796 19.04526
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1080985575


   Sorry Joris for the late reply!
   
   > is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata?
   
   That's a really good question...it's might possible to do this the other way around, which would make it a little more intuitive. Creating the R6 instance from C++ (usually via automatically generated wrapper code) is the default...the default constructor errors if there isn't a valid one:
   
   https://github.com/apache/arrow/blob/1b796ec3f9caeb5e86e3348ba940bef8d95915c5/r/R/arrow-package.R#L310-L322
   
   The alternative would be to materialize the C++ object when it is required. I will play with that because I really don't like the dot prefix thing that I'm currently doing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r817235734



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,154 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    }
+  )
+)
+
+ExtensionArray$.default_new <- ExtensionArray$new
+ExtensionArray$new <- function(xp) {
+  superclass <- ExtensionArray$.default_new(xp)
+  registered_type <- extension_type_registry[[superclass$type$extension_name()]]
+  if (is.null(registered_type)) {
+    return(superclass)
+  }
+
+  class <- registered_type$.__enclos_env__$private$array_class
+  if (inherits(superclass, class$classname)) {
+    return(superclass)
+  }
+
+  class$new(xp)
+}
+
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$.Deserialize(
+        self$storage_type(),
+        self$extension_name(),
+        self$Serialize()
+      )
+    },
+
+    .set_r6_constructors = function(type_class, array_class) {
+      private$type_class <- type_class
+      private$array_class <- array_class
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },
+
+    MakeArray = function(data) {
+      assert_is(data, "ArrayData")
+      ExtensionType__MakeArray(self, data)
+    },
+
+    WrapArray = function(array) {
+      assert_is(array, "Array")
+      self$MakeArray(array$data())
+    },
+
+    ToString = function() {
+      metadata_utf8 <- rawToChar(self$Serialize())
+      Encoding(metadata_utf8) <- "UTF-8"
+      paste0(class(self)[1], " <", metadata_utf8, ">")
+    },
+
+    .Deserialize = function(storage_type, extension_name, extension_metadata) {
+      # Do nothing by default but allow other classes to override this method
+      # to populate R6 class members.
+    }
+  ),
+
+  private = list(
+    type_class = NULL,
+    array_class = NULL
+  )
+)
+
+ExtensionType$.default_new <- ExtensionType$new
+ExtensionType$new <- function(xp) {
+  superclass <- ExtensionType$.default_new(xp)
+  registered_type <- extension_type_registry[[superclass$extension_name()]]
+  if (is.null(registered_type)) {
+    return(superclass)
+  }
+
+  registered_type$.__enclos_env__$private$type_class$new(xp)
+}
+
+
+MakeExtensionType <- function(storage_type,
+                              extension_name, extension_metadata,

Review comment:
       Very minor
   
   ```suggestion
                                 extension_name, 
                                 extension_metadata,
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1074365289


   This is very exciting. Like I mentioned earlier, I wanted to try this out locally to see what this looks like. The example is a little contrived (and actually AFAIU, not totally accurate depending on the time of year!)
   
   Is it expected that roundtripping without the {vctrs} class wouldn't work? (Or did I do something wrong here?
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   # Is this the minimal structure to create a custom class like this?
   KoreanAge <- R6::R6Class(
     "KoreanAge", 
     inherit = ExtensionType,
     public = list(
       .array_as_vector = function(extension_array) {
         extension_array$storage()$as_vector() + 1
       }
     )
   )
   
   KoreanAge <- new_extension_type(
     int32(),
     "KoreanAge",
     charToRaw("Korean Age, but stored as the western age value"),
     type_class = KoreanAge
   )
   
   arr <- new_extension_array(c(0, 1, 2), KoreanAge)
   
   # What we expect (storage + 1)
   as.vector(arr)
   #> [1] 1 2 3
   
   # But roundtripping doesn't seem to work?
   tf <- tempfile()
   write_feather(arrow_table(col = arr), tf)
   
   tab <- read_feather(tf, as_data_frame = FALSE)
   
   type(tab$col)
   #> Int32
   #> int32
   
   as.vector(tab$col)
   #> [1] 0 1 2
   ```
   
   Also, should we export `ExtensionArray`? It doesn't look like it is, but we do have `Array` etc. exported. The docs additions are really great + descriptive. But I do wonder if an example (or two) would be nice, even if they were pretty trivial extension type like this (or some of the vctrs examples with percentages and the like).
   
   Do we have a follow on for what to do about printing the array? You'll see here you print the underlying storage type, which might be fine, but that has confused some folks before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075451977


   I haven't dug too deeply yet, but the `QuantizedType` example is _fantastic_. I really like that _ think it shows off what/how this is capable of doing in a way that the previous examples (both mine and the ones in the test) didn't quite fully get to


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075489604


   > I think so! This example is probably better
   
   That example indeed nicely shows that it is possible. 
   I suppose one thing that confused me is that `new_extension_type(..)` rather creates a new *instance* of a given type, and not a new type itself (and thus this allows you to create instances of a given extension type with different metadata).
   
   > It isn't all that straightforward to do (the way I've implemented it in R) and I'm not sure I like how it's implemented (but I'm also not sure how to make it better).
   
   I think the fact that you already determine the serialized metadata upfront in R is fine / nice (the fact that in Python this is C++ calling back into python is kind of a complication, as the metadata could be known at the point when instantiating the python extension type instance).  
   One thing I was wondering: the reason you can't define a `.Serialize` method on the QuantizedType that can hold this logic to create the serialized metadata (instead of doing that inside `new_extension_type( ..., extension_metadata=... )`), is that this class instance is only created after the underlying C++ extension type, for which you already need the serialized metadata? It's not possible to first initialize the QuantizedType object, and let that call Serialize to then create the underlying C++ object? (disclaimer: I know nothing about R classes :))
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r837739566



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },
+
+    # To make sure this conversion is done properly
+    SerializeUTF8 = function() {
+      metadata_utf8 <- rawToChar(self$Serialize())
+      Encoding(metadata_utf8) <- "UTF-8"
+      metadata_utf8
+    },
+
+    WrapArray = function(array) {
+      assert_is(array, "Array")
+      ExtensionType__MakeArray(self, array$data())
+    },
+
+    Deserialize = function() {
+      # Do nothing by default but allow other classes to override this method
+      # to populate R6 class members.
+    },
+
+    ExtensionEquals = function(other) {
+      inherits(other, "ExtensionType") &&
+        identical(other$extension_name(), self$extension_name()) &&
+        identical(other$Serialize(), self$Serialize())
+    },
+
+    as_vector = function(extension_array) {
+      if (inherits(extension_array, "ChunkedArray")) {
+        # Converting one array at a time so that users don't have to remember
+        # to implement two methods. Converting all the storage arrays to
+        # a ChunkedArray and then converting is probably faster
+        # (VctrsExtensionType does this).
+        storage_vectors <- lapply(
+          seq_len(extension_array$num_chunks) - 1L,
+          function(i) self$as_vector(extension_array$chunk(i))
+        )
+
+        vctrs::vec_c(!!! storage_vectors)
+      } else if (inherits(extension_array, "ExtensionArray")) {
+        extension_array$storage()$as_vector()
+      } else {
+        classes <- paste(class(extension_array), collapse = " / ")
+        abort(
+          c(
+            "`extension_array` must be a ChunkedArray or ExtensionArray",
+            i = glue::glue("Got object of type {classes}")
+          )
+        )
+      }
+    },
+
+    ToString = function() {
+      # metadata is probably valid UTF-8 (e.g., JSON), but might not be
+      # and it's confusing to error when printing the object. This herustic
+      # isn't perfect (but subclasses should override this method anyway)
+      metadata_raw <- self$Serialize()
+
+      if (as.raw(0x00) %in% metadata_raw) {
+        if (length(metadata_raw) > 20) {
+          sprintf(
+            "<%s %s...>",
+            class(self)[1],
+            paste(format(utils::head(metadata_raw, 20)), collapse = " ")
+          )
+        } else {
+          sprintf(
+            "<%s %s>",
+            class(self)[1],
+            paste(format(metadata_raw), collapse = " ")
+          )
+        }
+
+      } else {
+        paste0(class(self)[1], " <", self$SerializeUTF8(), ">")
+      }
+    }
+  )
+)
+
+# ExtensionType$new() is what gets used by the generated wrapper code to
+# create an R6 object when a shared_ptr<DataType> is returned to R and
+# that object has type_id() EXTENSION_TYPE. Rather than add complexity
+# to the wrapper code, we modify ExtensionType$new() to do what we need
+# it to do here (which is to return an instance of a custom R6
+# type whose .Deserialize method is called to populate custom fields).
+ExtensionType$.default_new <- ExtensionType$new
+ExtensionType$new <- function(xp) {
+  super <- ExtensionType$.default_new(xp)
+  r6_class <- super$r6_class()
+  if (identical(r6_class$classname, "ExtensionType")) {
+    super
+  } else {
+    r6_class$new(xp)
+  }
+}
+
+ExtensionType$create <- function(storage_type,
+                                 extension_name,
+                                 extension_metadata = raw(),
+                                 type_class = ExtensionType) {
+  if (is.string(extension_metadata)) {
+    extension_metadata <- charToRaw(enc2utf8(extension_metadata))
+  }
+
+  assert_that(is.string(extension_name), is.raw(extension_metadata))
+  assert_is(storage_type, "DataType")
+  assert_is(type_class, "R6ClassGenerator")
+
+  ExtensionType__initialize(
+    storage_type,
+    extension_name,
+    extension_metadata,
+    type_class
+  )
+}
+
+#' Extension types
+#'
+#' Extension arrays are wrappers around regular Arrow [Array] objects
+#' that provide some customized behaviour and/or storage. A common use-case
+#' for extension types is to define a customized conversion between an
+#' an Arrow [Array] and an R object when the default conversion is slow
+#' or looses metadata important to the interpretation of values in the array.
+#' For most types, the built-in
+#' [vctrs extension type][vctrs_extension_type] is probably sufficient.
+#'
+#' These functions create, register, and unregister [ExtensionType]
+#' and [ExtensionArray] objects. To use an extension type you will have to:
+#'
+#' - Define an [R6::R6Class] that inherits from [ExtensionType] and reimplement
+#'   one or more methods (e.g., `Deserialize()`).
+#' - Make a type constructor function (e.g., `my_extension_type()`) that calls
+#'   [new_extension_type()] to create an R6 instance that can be used as a
+#'   [data type][data-type] elsewhere in the package.
+#' - Make an array constructor function (e.g., `my_extension_array()`) that
+#'   calls [new_extension_array()] to create an [Array] instance of your
+#'   extension type.
+#' - Register a dummy instance of your extension type created using
+#'   you constructor function using [register_extension_type()].
+#'
+#' If defining an extension type in an R package, you will probably want to
+#' use [reregister_extension_type()] in that package's [.onLoad()] hook
+#' since your package will probably get reloaded in the same R session
+#' during its development and [register_extension_type()] will error if
+#' called twice for the same `extension_name`. For an example of an
+#' extension type that uses most of these features, see
+#' [vctrs_extension_type()].
+#'
+#' @param storage_type The [data type][data-type] of the underlying storage
+#'   array.
+#' @param storage_array An [Array] object of the underlying storage.
+#' @param extension_type An [ExtensionType] instance.
+#' @param extension_name The extension name. This should be namespaced using
+#'   "dot" syntax (i.e., "some_package.some_type"). The namespace "arrow"
+#'    is reserved for extension types defined by the Apache Arrow libraries.
+#' @param extension_metadata A [raw()] or [character()] vector containing the
+#'   serialized version of the type. Chatacter vectors must be length 1 and
+#'   are converted to UTF-8 before converting to [raw()].
+#' @param type_class An [R6::R6Class] whose `$new()` class method will be
+#'   used to construct a new instance of the type.
+#'
+#' @return
+#'   - `new_extension_type()` returns an [ExtensionType] instance according
+#'     to the `type_class` specified.
+#'   - `new_extension_array()` returns an [ExtensionArray] whose `$type`
+#'     corresponds to `extension_type`.
+#'   - `register_extension_type()`, `unregister_extension_type()`
+#'      and `reregister_extension_type()` return `NULL`, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_available()
+#' # Create the R6 type whose methods control how Array objects are
+#' # converted to R objects, how equality between types is computed,
+#' # and how types are printed.
+#' QuantizedType <- R6::R6Class(
+#'   "QuantizedType",
+#'   inherit = ExtensionType,
+#'   public = list(
+#'     # methods to access the custom metadata fields
+#'     center = function() private$.center,
+#'     scale = function() private$.scale,
+#'
+#'     # called when an Array of this type is converted to an R vector
+#'     as_vector = function(extension_array) {
+#'       if (inherits(extension_array, "ExtensionArray"))
+#'         unquantized_arrow <-
+#'           (extension_array$storage()$cast(float64()) / private$.scale) +
+#'           private$.center
+#'
+#'         as.vector(unquantized_arrow)
+#'       } else {
+#'         super$as_vector(extension_array)
+#'       }
+#'     },
+#'
+#'     # populate the custom metadata fields from the serialized metadata
+#'     Deserialize = function() {
+#'       vals <- as.numeric(strsplit(self$SerializeUTF8(), ";")[[1]])
+#'       private$.center <- vals[1]
+#'       private$.scale <- vals[2]
+#'     }
+#'   ),
+#'
+#'   private = list(
+#'     .center = NULL,
+#'     .scale = NULL
+#'   )
+#' )
+#'
+#' # Create a helper type constructor that calls new_extension_type()
+#' quantized <- function(center = 0, scale = 1, storage_type = int32()) {
+#'   new_extension_type(
+#'     storage_type = storage_type,
+#'     extension_name = "arrow.example.quantized",
+#'     extension_metadata = paste(center, scale, sep = ";"),
+#'     type_class = QuantizedType
+#'   )
+#' }
+#'
+#' # Create a helper array constructor that calls new_extension_array()
+#' quantized_array <- function(x, center = 0, scale = 1,
+#'                             storage_type = int32()) {
+#'   type <- quantized(center, scale, storage_type)
+#'   new_extension_array(
+#'     Array$create((x - center) * scale, type = storage_type),
+#'     type
+#'   )
+#' }
+#'
+#' # Register the extension type so that Arrow knows what to do when
+#' # it encounters this extension type
+#' reregister_extension_type(quantized())
+#'
+#' # Create Array objects and use them!
+#' (vals <- runif(5, min = 19, max = 21))
+#'
+#' (array <- quantized_array(
+#'   vals,
+#'   center = 20,
+#'   scale = 2 ^ 15 - 1,
+#'   storage_type = int16())
+#' )

Review comment:
       I'm curious what the reason is for wrapping `()` here. I've seen it used to silence output, but both of these should already have no output, right?

##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       This is minor (and we might also use this terminology like this elsewhere), but we might add something extra to this name so that it's clear that this is not serializing _the array_ but rather just the metadata information (and same for `Deserialize`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r838761611



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       Antoine suggested `populate_instance()`, although I think I like `deserialize_extension()` a bit better. I'll give it one more pass and see what feels the best.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1045274054






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r837942773



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },
+
+    # To make sure this conversion is done properly
+    SerializeUTF8 = function() {
+      metadata_utf8 <- rawToChar(self$Serialize())
+      Encoding(metadata_utf8) <- "UTF-8"
+      metadata_utf8
+    },
+
+    WrapArray = function(array) {
+      assert_is(array, "Array")
+      ExtensionType__MakeArray(self, array$data())
+    },
+
+    Deserialize = function() {
+      # Do nothing by default but allow other classes to override this method
+      # to populate R6 class members.
+    },
+
+    ExtensionEquals = function(other) {
+      inherits(other, "ExtensionType") &&
+        identical(other$extension_name(), self$extension_name()) &&
+        identical(other$Serialize(), self$Serialize())
+    },
+
+    as_vector = function(extension_array) {
+      if (inherits(extension_array, "ChunkedArray")) {
+        # Converting one array at a time so that users don't have to remember
+        # to implement two methods. Converting all the storage arrays to
+        # a ChunkedArray and then converting is probably faster
+        # (VctrsExtensionType does this).
+        storage_vectors <- lapply(
+          seq_len(extension_array$num_chunks) - 1L,
+          function(i) self$as_vector(extension_array$chunk(i))
+        )
+
+        vctrs::vec_c(!!! storage_vectors)
+      } else if (inherits(extension_array, "ExtensionArray")) {
+        extension_array$storage()$as_vector()
+      } else {
+        classes <- paste(class(extension_array), collapse = " / ")
+        abort(
+          c(
+            "`extension_array` must be a ChunkedArray or ExtensionArray",
+            i = glue::glue("Got object of type {classes}")
+          )
+        )
+      }
+    },
+
+    ToString = function() {
+      # metadata is probably valid UTF-8 (e.g., JSON), but might not be
+      # and it's confusing to error when printing the object. This herustic
+      # isn't perfect (but subclasses should override this method anyway)
+      metadata_raw <- self$Serialize()
+
+      if (as.raw(0x00) %in% metadata_raw) {
+        if (length(metadata_raw) > 20) {
+          sprintf(
+            "<%s %s...>",
+            class(self)[1],
+            paste(format(utils::head(metadata_raw, 20)), collapse = " ")
+          )
+        } else {
+          sprintf(
+            "<%s %s>",
+            class(self)[1],
+            paste(format(metadata_raw), collapse = " ")
+          )
+        }
+
+      } else {
+        paste0(class(self)[1], " <", self$SerializeUTF8(), ">")
+      }
+    }
+  )
+)
+
+# ExtensionType$new() is what gets used by the generated wrapper code to
+# create an R6 object when a shared_ptr<DataType> is returned to R and
+# that object has type_id() EXTENSION_TYPE. Rather than add complexity
+# to the wrapper code, we modify ExtensionType$new() to do what we need
+# it to do here (which is to return an instance of a custom R6
+# type whose .Deserialize method is called to populate custom fields).
+ExtensionType$.default_new <- ExtensionType$new
+ExtensionType$new <- function(xp) {
+  super <- ExtensionType$.default_new(xp)
+  r6_class <- super$r6_class()
+  if (identical(r6_class$classname, "ExtensionType")) {
+    super
+  } else {
+    r6_class$new(xp)
+  }
+}
+
+ExtensionType$create <- function(storage_type,
+                                 extension_name,
+                                 extension_metadata = raw(),
+                                 type_class = ExtensionType) {
+  if (is.string(extension_metadata)) {
+    extension_metadata <- charToRaw(enc2utf8(extension_metadata))
+  }
+
+  assert_that(is.string(extension_name), is.raw(extension_metadata))
+  assert_is(storage_type, "DataType")
+  assert_is(type_class, "R6ClassGenerator")
+
+  ExtensionType__initialize(
+    storage_type,
+    extension_name,
+    extension_metadata,
+    type_class
+  )
+}
+
+#' Extension types
+#'
+#' Extension arrays are wrappers around regular Arrow [Array] objects
+#' that provide some customized behaviour and/or storage. A common use-case
+#' for extension types is to define a customized conversion between an
+#' an Arrow [Array] and an R object when the default conversion is slow
+#' or looses metadata important to the interpretation of values in the array.
+#' For most types, the built-in
+#' [vctrs extension type][vctrs_extension_type] is probably sufficient.
+#'
+#' These functions create, register, and unregister [ExtensionType]
+#' and [ExtensionArray] objects. To use an extension type you will have to:
+#'
+#' - Define an [R6::R6Class] that inherits from [ExtensionType] and reimplement
+#'   one or more methods (e.g., `Deserialize()`).
+#' - Make a type constructor function (e.g., `my_extension_type()`) that calls
+#'   [new_extension_type()] to create an R6 instance that can be used as a
+#'   [data type][data-type] elsewhere in the package.
+#' - Make an array constructor function (e.g., `my_extension_array()`) that
+#'   calls [new_extension_array()] to create an [Array] instance of your
+#'   extension type.
+#' - Register a dummy instance of your extension type created using
+#'   you constructor function using [register_extension_type()].
+#'
+#' If defining an extension type in an R package, you will probably want to
+#' use [reregister_extension_type()] in that package's [.onLoad()] hook
+#' since your package will probably get reloaded in the same R session
+#' during its development and [register_extension_type()] will error if
+#' called twice for the same `extension_name`. For an example of an
+#' extension type that uses most of these features, see
+#' [vctrs_extension_type()].
+#'
+#' @param storage_type The [data type][data-type] of the underlying storage
+#'   array.
+#' @param storage_array An [Array] object of the underlying storage.
+#' @param extension_type An [ExtensionType] instance.
+#' @param extension_name The extension name. This should be namespaced using
+#'   "dot" syntax (i.e., "some_package.some_type"). The namespace "arrow"
+#'    is reserved for extension types defined by the Apache Arrow libraries.
+#' @param extension_metadata A [raw()] or [character()] vector containing the
+#'   serialized version of the type. Chatacter vectors must be length 1 and
+#'   are converted to UTF-8 before converting to [raw()].
+#' @param type_class An [R6::R6Class] whose `$new()` class method will be
+#'   used to construct a new instance of the type.
+#'
+#' @return
+#'   - `new_extension_type()` returns an [ExtensionType] instance according
+#'     to the `type_class` specified.
+#'   - `new_extension_array()` returns an [ExtensionArray] whose `$type`
+#'     corresponds to `extension_type`.
+#'   - `register_extension_type()`, `unregister_extension_type()`
+#'      and `reregister_extension_type()` return `NULL`, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_available()
+#' # Create the R6 type whose methods control how Array objects are
+#' # converted to R objects, how equality between types is computed,
+#' # and how types are printed.
+#' QuantizedType <- R6::R6Class(
+#'   "QuantizedType",
+#'   inherit = ExtensionType,
+#'   public = list(
+#'     # methods to access the custom metadata fields
+#'     center = function() private$.center,
+#'     scale = function() private$.scale,
+#'
+#'     # called when an Array of this type is converted to an R vector
+#'     as_vector = function(extension_array) {
+#'       if (inherits(extension_array, "ExtensionArray"))
+#'         unquantized_arrow <-
+#'           (extension_array$storage()$cast(float64()) / private$.scale) +
+#'           private$.center
+#'
+#'         as.vector(unquantized_arrow)
+#'       } else {
+#'         super$as_vector(extension_array)
+#'       }
+#'     },
+#'
+#'     # populate the custom metadata fields from the serialized metadata
+#'     Deserialize = function() {
+#'       vals <- as.numeric(strsplit(self$SerializeUTF8(), ";")[[1]])
+#'       private$.center <- vals[1]
+#'       private$.scale <- vals[2]
+#'     }
+#'   ),
+#'
+#'   private = list(
+#'     .center = NULL,
+#'     .scale = NULL
+#'   )
+#' )
+#'
+#' # Create a helper type constructor that calls new_extension_type()
+#' quantized <- function(center = 0, scale = 1, storage_type = int32()) {
+#'   new_extension_type(
+#'     storage_type = storage_type,
+#'     extension_name = "arrow.example.quantized",
+#'     extension_metadata = paste(center, scale, sep = ";"),
+#'     type_class = QuantizedType
+#'   )
+#' }
+#'
+#' # Create a helper array constructor that calls new_extension_array()
+#' quantized_array <- function(x, center = 0, scale = 1,
+#'                             storage_type = int32()) {
+#'   type <- quantized(center, scale, storage_type)
+#'   new_extension_array(
+#'     Array$create((x - center) * scale, type = storage_type),
+#'     type
+#'   )
+#' }
+#'
+#' # Register the extension type so that Arrow knows what to do when
+#' # it encounters this extension type
+#' reregister_extension_type(quantized())
+#'
+#' # Create Array objects and use them!
+#' (vals <- runif(5, min = 19, max = 21))
+#'
+#' (array <- quantized_array(
+#'   vals,
+#'   center = 20,
+#'   scale = 2 ^ 15 - 1,
+#'   storage_type = int16())
+#' )

Review comment:
       Oh, oh right it's the opposite of silencing 🤦. We can keep it in — in the actual examples on the pkgdown site it'll be obvious what's going on.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r837943014



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       Yeah, those names read much more naturally to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r838514284



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       Any thoughts on the best name for `Deserialize()`? In Python this is `__arrow_ext_deserialize__()` (although does something slightly different)...maybe `restore_extension()`? `restore()`? It also could be left out (subclassers would be then forced to override `initialize(xp)`, which might be more intuitive if the subclasser knows anything about R6).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1082282379


   I messaged Joris and Romain today asking for reviews...it's getting close but I think will benefit from their take!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r837824168



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       I stole these names from the C++, but we can rename them to something else (an earlier version of this PR used `extension_metadata()` and `extension_metadata_utf8()`; the Python version also uses some alternative names so there's certainly precedent).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r838771333



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       I went with `deserialize_instance()` (we *are* deserializing, but maybe more clear that it's specific to this instance and not creating a new one?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r829146168



##########
File path: r/R/arrow-tabular.R
##########
@@ -98,6 +98,40 @@ ArrowTabular <- R6Class("ArrowTabular",
   )
 )
 
+tabular_as_data_frame_common <- function(x, base) {

Review comment:
       I'm worried about this implementation because it's unintuitive...this gets used by `Table$to_data_frame()` and `RecordBatch$to_data_frame()` because both of those call into C++ to do their thing (but the C++ implementation doesn't know about extension types. Maybe it should?). Pretty much everwhere else we avoid looping over columns in R but that might be better than added complexity at the C++ level?

##########
File path: r/src/extension.cpp
##########
@@ -0,0 +1,246 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "./arrow_types.h"
+
+#if defined(ARROW_R_WITH_ARROW)
+
+#include <thread>
+
+#include <arrow/array.h>
+#include <arrow/extension_type.h>
+#include <arrow/type.h>
+
+// A wrapper around arrow::ExtensionType that allows R to register extension
+// types whose Deserialize, ExtensionEquals, and Serialize methods are
+// in meanintfully handled at the R level. At the C++ level, the type is
+// already serialized to minimize calls to R from C++.
+//
+// Using a std::shared_ptr<> to wrap a cpp11::sexp type is unusual, but we
+// need it here to avoid calling the copy constructor from another thread,
+// since this might call into the R API. If we don't do this, we get crashes
+// when reading a multi-file Dataset.

Review comment:
       It's true that I need to use `std::shared_ptr<cpp11::environment>` to store a the `r6_class_` field here instead of `cpp11::environment` to avoid a crash, but I'm not entirely sure I'm using `std::shared_ptr` correctly.

##########
File path: r/src/extension.cpp
##########
@@ -0,0 +1,246 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "./arrow_types.h"
+
+#if defined(ARROW_R_WITH_ARROW)
+
+#include <thread>
+
+#include <arrow/array.h>
+#include <arrow/extension_type.h>
+#include <arrow/type.h>
+
+// A wrapper around arrow::ExtensionType that allows R to register extension
+// types whose Deserialize, ExtensionEquals, and Serialize methods are
+// in meanintfully handled at the R level. At the C++ level, the type is
+// already serialized to minimize calls to R from C++.
+//
+// Using a std::shared_ptr<> to wrap a cpp11::sexp type is unusual, but we
+// need it here to avoid calling the copy constructor from another thread,
+// since this might call into the R API. If we don't do this, we get crashes
+// when reading a multi-file Dataset.
+class RExtensionType : public arrow::ExtensionType {
+ public:
+  RExtensionType(const std::shared_ptr<arrow::DataType> storage_type,
+                 std::string extension_name, std::string extension_metadata,
+                 std::shared_ptr<cpp11::environment> r6_class,
+                 std::thread::id creation_thread)
+      : arrow::ExtensionType(storage_type),
+        extension_name_(extension_name),
+        extension_metadata_(extension_metadata),
+        r6_class_(r6_class),
+        creation_thread_(creation_thread) {}
+
+  std::string extension_name() const { return extension_name_; }
+
+  bool ExtensionEquals(const arrow::ExtensionType& other) const;
+
+  std::shared_ptr<arrow::Array> MakeArray(std::shared_ptr<arrow::ArrayData> data) const;
+
+  arrow::Result<std::shared_ptr<arrow::DataType>> Deserialize(
+      std::shared_ptr<arrow::DataType> storage_type,
+      const std::string& serialized_data) const;
+
+  std::string Serialize() const { return extension_metadata_; }
+
+  std::string ToString() const;
+
+  std::unique_ptr<RExtensionType> Clone() const;
+
+  cpp11::environment r6_class() const { return *r6_class_; }
+
+  cpp11::environment r6_instance(std::shared_ptr<arrow::DataType> storage_type,
+                                 const std::string& serialized_data) const;
+
+  cpp11::environment r6_instance() const {
+    return r6_instance(storage_type(), Serialize());
+  }
+
+ private:
+  std::string extension_name_;
+  std::string extension_metadata_;
+  std::string cached_to_string_;
+  std::shared_ptr<cpp11::environment> r6_class_;
+  std::thread::id creation_thread_;
+
+  arrow::Status assert_r_thread() const {
+    if (std::this_thread::get_id() == creation_thread_) {
+      return arrow::Status::OK();
+    } else {
+      return arrow::Status::ExecutionError("RExtensionType <", extension_name_,
+                                           "> attempted to call into R ",
+                                           "from a non-R thread");
+    }
+  }
+};
+
+bool RExtensionType::ExtensionEquals(const arrow::ExtensionType& other) const {
+  // Avoid materializing the R6 instance if at all possible, since this is slow
+  // and in some cases not possible due to threading
+  if (other.extension_name() != extension_name()) {
+    return false;
+  }
+
+  if (other.Serialize() == Serialize()) {
+    return true;
+  }
+
+  // With any ambiguity, we need to materialize the R6 instance and call its
+  // ExtensionEquals method. We can't do this on the non-R thread.
+  arrow::Status is_r_thread = assert_r_thread();
+  if (!assert_r_thread().ok()) {
+    throw std::runtime_error(is_r_thread.message());
+  }
+
+  cpp11::environment instance = r6_instance();
+  cpp11::function instance_ExtensionEquals(instance[".ExtensionEquals"]);
+
+  std::shared_ptr<DataType> other_shared =
+      ValueOrStop(other.Deserialize(other.storage_type(), other.Serialize()));
+  cpp11::sexp other_r6 = cpp11::to_r6<DataType>(other_shared, "ExtensionType");
+
+  cpp11::logicals result(instance_ExtensionEquals(other_r6));
+  return cpp11::as_cpp<bool>(result);
+}
+
+std::shared_ptr<arrow::Array> RExtensionType::MakeArray(
+    std::shared_ptr<arrow::ArrayData> data) const {
+  std::shared_ptr<arrow::ArrayData> new_data = data->Copy();
+  std::unique_ptr<RExtensionType> cloned = Clone();
+  new_data->type = std::shared_ptr<RExtensionType>(cloned.release());
+  return std::make_shared<arrow::ExtensionArray>(new_data);
+}
+
+arrow::Result<std::shared_ptr<arrow::DataType>> RExtensionType::Deserialize(
+    std::shared_ptr<arrow::DataType> storage_type,
+    const std::string& serialized_data) const {
+  std::unique_ptr<RExtensionType> cloned = Clone();
+  cloned->storage_type_ = storage_type;
+  cloned->extension_metadata_ = serialized_data;
+
+  // We probably should create an ephemeral R6 instance here, which will call
+  // the R6 instance's .Deserialize() method, possibly erroring when the metadata is
+  // invalid or the deserialized values are invalid. When there is an error it will be
+  // confusing, since it will only occur when the result surfaces to R
+  // (which might be much later). Unfortunately, the Deserialize() method gets
+  // called from other threads frequently (e.g., when reading a multi-file Dataset),
+  // and we get crashes if we try this. As a compromise, we call this method when we can
+  // to maximize the likelihood an error is surfaced.
+  if (assert_r_thread().ok()) {
+    cloned->r6_instance();
+  }

Review comment:
       This is the main threading concern...the `Deserialize()` method gets called from other threads frequently but unless it's been passed through an R6 instance in R, we don't know if the metadata is valid or not.

##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$.array_as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inheritx(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension metadata
+#'   as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$.Deserialize(storage_type, extension_name, extension_metadata)`
+#'   This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized `extension_metadata` (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$.array_as_vector(extension_array)`: Convert an [Array] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects or when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()]. The default method returns the converted
+#'   storage array.
+#' - `$.chunked_array_as_vector(chunked_array)`: Convert a [ChunkedArray]
+#'   to an R vector. This method is called by [as.vector()] on a [ChunkedArray]
+#'   whose type matches this extension type or when a [Table] containing
+#'   such a column is converted to a [data.frame()]. The default method
+#'   returns the converted version of the equivalent storage arrays
+#'   as a [ChunkedArray].
+#' - `$.ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call .Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$.Deserialize(
+        self$storage_type(),
+        self$extension_name(),
+        self$Serialize()
+      )
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },
+
+    MakeArray = function(data) {
+      assert_is(data, "ArrayData")
+      ExtensionType__MakeArray(self, data)
+    },
+
+    WrapArray = function(array) {
+      assert_is(array, "Array")
+      self$MakeArray(array$data())
+    },
+

Review comment:
       I made up the "dot prefix means protected method" thing here...I don't know if there is a convention for "protected"-style methods in R6 but would be happy to use it if it exists.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r831053629



##########
File path: r/src/extension.cpp
##########
@@ -0,0 +1,246 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "./arrow_types.h"
+
+#if defined(ARROW_R_WITH_ARROW)
+
+#include <thread>
+
+#include <arrow/array.h>
+#include <arrow/extension_type.h>
+#include <arrow/type.h>
+
+// A wrapper around arrow::ExtensionType that allows R to register extension
+// types whose Deserialize, ExtensionEquals, and Serialize methods are
+// in meanintfully handled at the R level. At the C++ level, the type is
+// already serialized to minimize calls to R from C++.
+//
+// Using a std::shared_ptr<> to wrap a cpp11::sexp type is unusual, but we
+// need it here to avoid calling the copy constructor from another thread,
+// since this might call into the R API. If we don't do this, we get crashes
+// when reading a multi-file Dataset.
+class RExtensionType : public arrow::ExtensionType {
+ public:
+  RExtensionType(const std::shared_ptr<arrow::DataType> storage_type,
+                 std::string extension_name, std::string extension_metadata,
+                 std::shared_ptr<cpp11::environment> r6_class,
+                 std::thread::id creation_thread)
+      : arrow::ExtensionType(storage_type),
+        extension_name_(extension_name),
+        extension_metadata_(extension_metadata),
+        r6_class_(r6_class),
+        creation_thread_(creation_thread) {}
+
+  std::string extension_name() const { return extension_name_; }
+
+  bool ExtensionEquals(const arrow::ExtensionType& other) const;
+
+  std::shared_ptr<arrow::Array> MakeArray(std::shared_ptr<arrow::ArrayData> data) const;
+
+  arrow::Result<std::shared_ptr<arrow::DataType>> Deserialize(
+      std::shared_ptr<arrow::DataType> storage_type,
+      const std::string& serialized_data) const;
+
+  std::string Serialize() const { return extension_metadata_; }
+
+  std::string ToString() const;
+
+  std::unique_ptr<RExtensionType> Clone() const;
+
+  cpp11::environment r6_class() const { return *r6_class_; }
+
+  cpp11::environment r6_instance(std::shared_ptr<arrow::DataType> storage_type,
+                                 const std::string& serialized_data) const;
+
+  cpp11::environment r6_instance() const {
+    return r6_instance(storage_type(), Serialize());
+  }
+
+ private:
+  std::string extension_name_;
+  std::string extension_metadata_;
+  std::string cached_to_string_;
+  std::shared_ptr<cpp11::environment> r6_class_;
+  std::thread::id creation_thread_;
+
+  arrow::Status assert_r_thread() const {
+    if (std::this_thread::get_id() == creation_thread_) {
+      return arrow::Status::OK();
+    } else {
+      return arrow::Status::ExecutionError("RExtensionType <", extension_name_,
+                                           "> attempted to call into R ",
+                                           "from a non-R thread");
+    }
+  }
+};
+
+bool RExtensionType::ExtensionEquals(const arrow::ExtensionType& other) const {
+  // Avoid materializing the R6 instance if at all possible, since this is slow
+  // and in some cases not possible due to threading
+  if (other.extension_name() != extension_name()) {
+    return false;
+  }
+
+  if (other.Serialize() == Serialize()) {
+    return true;
+  }
+
+  // With any ambiguity, we need to materialize the R6 instance and call its
+  // ExtensionEquals method. We can't do this on the non-R thread.
+  arrow::Status is_r_thread = assert_r_thread();
+  if (!assert_r_thread().ok()) {
+    throw std::runtime_error(is_r_thread.message());
+  }

Review comment:
       This is another threading concern...the data types can't be checked for equality if the serialized data is not identical and the comparison occurs on another thread. I think this might happen when reading a multi-file dataset if some of the files were written differently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r831470356



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$.array_as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inheritx(x, "ExtensionArray") && type$Equals(x$type)) {

Review comment:
       ```suggestion
     if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
   ```
   
   This looks like a typo? Though I'm surprised it got through — though maybe there aren't tests for creating the extension array itself?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075158797


   Aaaah, yeah I totally missed `register_extension_type()` oops!
   
   Though weirdly(?) the `as.vector()` here is still wrong, it should be `1 2 3` (since the ->R conversion on this Extension type should have 1 added to it)
   
   ```
   # you need to register the type for Arrow C++ to keep the extension type
   # slash metadata when it encounters it at the C++ level (import from C
   # and reading files)
   register_extension_type(korean_age())
   
   tf <- tempfile()
   write_feather(arrow_table(col = arr), tf)
   
   tab <- read_feather(tf, as_data_frame = FALSE)
   
   type(tab$col)
   #> KoreanAge
   #> KoreanAge <Korean Age, but stored as the western age value>
   
   as.vector(tab$col)
   #> [1] 0 1 2
   ```
   
   > I should probably export ExtensionArray and use ExtensionArray$create() rather than new_extension_array() since it's more arrow-ish to do that. 
   
   In other places we've exposed both, which I think isn't bad here (it's slightly more API we manage, but having the R6 stuff exposed makes it easier to extend, and having the `foo_bar_baz` looks nicer | is a bit more friendly to users|doesn't force someone to learn all the R6 if what they want or need is available there)...
   
   > Maybe ExtensionType$create() instead of new_extension_type() is where extension type creation should go, too.
   
   Hmm, if we think it's ok to do the `new_extension_type()` inside of `ExtensionType$create()` I would agree that having a second step like that is unnecessary. Though with the proper docs having the R6 way of establishing these need to be slightly more verbose and you need to manage other things is inline with our other implementations of things like that (e.g. creating one's own filesystem, etc.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r832121328



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray

Review comment:
       At the top of the file makes sense to me so I'll do it that way!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot edited a comment on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot edited a comment on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1081896525


   A few more modifications:
   
   - I moved conversion of extension types to R objects into the Converter API and removed the modifications to ChunkedArray, Table, and RecordBatch (Romain suggested this). This feels a lot better and is more likely to "just work" in more places.
   - I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types)! See details and https://github.com/paleolimbot/geoarrow/pull/7
   - I did play with reversing the order of instantiation of the C++ and R6 objects...I think that change is a kind of a big one with respect to how all objects get passed around in our current implementation. I did, however, rename the methods to match the C++ method names and it feels a lot better now.
   
   <details>
   
   ``` r
   # remotes::install_github("apache/arrow#12467")
   # remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   library(geoarrow)
   
   places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
   places <- open_dataset(places_folder)
   places$schema$geometry$type
   #> GeoArrowType
   #> point GEOGCS["WGS 84",DATUM["WGS_...
   places$schema$geometry$type$crs
   #> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"
   
   # works!
   Scanner$create(places)$ToTable()
   #> Table
   #> 7255 rows x 6 columns
   #> $osm_id <string>
   #> $code <int32>
   #> $population <double>
   #> $name <string>
   #> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
   #> $fclass <string>
   #> 
   #> See $metadata for additional Schema metadata
   
   # works!
   as.data.frame(Scanner$create(places)$ToTable())
   #> # A tibble: 7,255 × 6
   #>    osm_id      code population name           geometry                    fclass
   #>    <chr>      <int>      <dbl> <chr>          <wk_wkb>                    <chr> 
   #>  1 21040334    1001      50781 Roskilde       <POINT (12.08192 55.64335)> city  
   #>  2 21040360    1001      72398 Esbjerg        <POINT (8.452075 55.46649)> city  
   #>  3 26559154    1001      62687 Randers        <POINT (10.03715 56.46175)> city  
   #>  4 26559170    1001      60508 Kolding        <POINT (9.47905 55.4895)>   city  
   #>  5 26559198    1001      56567 Vejle          <POINT (9.533324 55.70001)> city  
   #>  6 26559213    1001     273077 Aarhus         <POINT (10.2134 56.14963)>  city  
   #>  7 26559274    1001     178210 Odense         <POINT (10.38521 55.39972)> city  
   #>  8 1368129781  1001      58646 Horsens        <POINT (9.844477 55.86117)> city  
   #>  9 2247730880  1001     114194 Aalborg        <POINT (9.921526 57.04626)> city  
   #> 10 393558713   1030          0 Englebjerggård <POINT (11.77737 55.2004)>  farm  
   #> # … with 7,245 more rows
   
   # unfortunately, this fails...
   places %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     collect()
   #> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
   #> ! NotImplemented: concatenation of extension<geoarrow.point>
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195  VisitTypeInline(*out_->type, this)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590  ConcatenateImpl(data, pool).Concatenate(&out_data)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025  Concatenate(values.chunks(), ctx->memory_pool())
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084  TakeCA(*table.column(j), indices, options, ctx)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375  impl_->DoFinish()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
   
   # ...unless we unregister the extension type and use geoarrow_collect()
   arrow::unregister_extension_type("geoarrow.point")
   open_dataset(places_folder) %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     geoarrow_collect()
   #> # A tibble: 5 × 4
   #>   name          population fclass           geometry                   
   #>   <chr>              <dbl> <chr>            <wk_wkb>                   
   #> 1 København         613288 national_capital <POINT (12.57007 55.68672)>
   #> 2 Aarhus            273077 city             <POINT (10.2134 56.14963)> 
   #> 3 Odense            178210 city             <POINT (10.38521 55.39972)>
   #> 4 Aalborg           114194 city             <POINT (9.921526 57.04626)>
   #> 5 Frederiksberg     102029 suburb           <POINT (12.53262 55.67802)>
   ```
   
   <sup>Created on 2022-03-29 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1081896525


   ...and I implemented the geoarrow side of this to make sure it will work. It does (except for a bit in one of the compute kernels where Concatenate doesn't work for extension types). See details and https://github.com/paleolimbot/geoarrow/pull/7
   
   <details>
   
   ``` r
   # remotes::install_github("apache/arrow#12467")
   # remotes::install_github("paleolimbot/geoarrow@arrow-ext-type")
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   library(geoarrow)
   
   places_folder <- system.file("example_dataset/osm_places", package = "geoarrow")
   places <- open_dataset(places_folder)
   places$schema$geometry$type
   #> GeoArrowType
   #> point GEOGCS["WGS 84",DATUM["WGS_...
   places$schema$geometry$type$crs
   #> [1] "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],AXIS[\"Latitude\",NORTH]]"
   
   # works!
   Scanner$create(places)$ToTable()
   #> Table
   #> 7255 rows x 6 columns
   #> $osm_id <string>
   #> $code <int32>
   #> $population <double>
   #> $name <string>
   #> $geometry <point GEOGCS["WGS 84",DATUM["WGS_...>
   #> $fclass <string>
   #> 
   #> See $metadata for additional Schema metadata
   
   # works!
   as.data.frame(Scanner$create(places)$ToTable())
   #> # A tibble: 7,255 × 6
   #>    osm_id      code population name           geometry                    fclass
   #>    <chr>      <int>      <dbl> <chr>          <wk_wkb>                    <chr> 
   #>  1 21040334    1001      50781 Roskilde       <POINT (12.08192 55.64335)> city  
   #>  2 21040360    1001      72398 Esbjerg        <POINT (8.452075 55.46649)> city  
   #>  3 26559154    1001      62687 Randers        <POINT (10.03715 56.46175)> city  
   #>  4 26559170    1001      60508 Kolding        <POINT (9.47905 55.4895)>   city  
   #>  5 26559198    1001      56567 Vejle          <POINT (9.533324 55.70001)> city  
   #>  6 26559213    1001     273077 Aarhus         <POINT (10.2134 56.14963)>  city  
   #>  7 26559274    1001     178210 Odense         <POINT (10.38521 55.39972)> city  
   #>  8 1368129781  1001      58646 Horsens        <POINT (9.844477 55.86117)> city  
   #>  9 2247730880  1001     114194 Aalborg        <POINT (9.921526 57.04626)> city  
   #> 10 393558713   1030          0 Englebjerggård <POINT (11.77737 55.2004)>  farm  
   #> # … with 7,245 more rows
   
   # unfortunately, this fails...
   places %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     collect()
   #> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
   #> ! NotImplemented: concatenation of extension<geoarrow.point>
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:195  VisitTypeInline(*out_->type, this)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/array/concatenate.cc:590  ConcatenateImpl(data, pool).Concatenate(&out_data)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2025  Concatenate(values.chunks(), ctx->memory_pool())
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/kernels/vector_selection.cc:2084  TakeCA(*table.column(j), indices, options, ctx)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/sink_node.cc:375  impl_->DoFinish()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
   #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
   
   # ...unless we unregister the extension type and use geoarrow_collect()
   arrow::unregister_extension_type("geoarrow.point")
   open_dataset(places_folder) %>% 
     filter(population > 100000) %>% 
     select(name, population, fclass, geometry) %>% 
     arrange(desc(population)) %>% 
     geoarrow_collect()
   #> # A tibble: 5 × 4
   #>   name          population fclass           geometry                   
   #>   <chr>              <dbl> <chr>            <wk_wkb>                   
   #> 1 København         613288 national_capital <POINT (12.57007 55.68672)>
   #> 2 Aarhus            273077 city             <POINT (10.2134 56.14963)> 
   #> 3 Odense            178210 city             <POINT (10.38521 55.39972)>
   #> 4 Aalborg           114194 city             <POINT (9.921526 57.04626)>
   #> 5 Frederiksberg     102029 suburb           <POINT (12.53262 55.67802)>
   ```
   
   <sup>Created on 2022-03-29 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r837821333



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },
+
+    # To make sure this conversion is done properly
+    SerializeUTF8 = function() {
+      metadata_utf8 <- rawToChar(self$Serialize())
+      Encoding(metadata_utf8) <- "UTF-8"
+      metadata_utf8
+    },
+
+    WrapArray = function(array) {
+      assert_is(array, "Array")
+      ExtensionType__MakeArray(self, array$data())
+    },
+
+    Deserialize = function() {
+      # Do nothing by default but allow other classes to override this method
+      # to populate R6 class members.
+    },
+
+    ExtensionEquals = function(other) {
+      inherits(other, "ExtensionType") &&
+        identical(other$extension_name(), self$extension_name()) &&
+        identical(other$Serialize(), self$Serialize())
+    },
+
+    as_vector = function(extension_array) {
+      if (inherits(extension_array, "ChunkedArray")) {
+        # Converting one array at a time so that users don't have to remember
+        # to implement two methods. Converting all the storage arrays to
+        # a ChunkedArray and then converting is probably faster
+        # (VctrsExtensionType does this).
+        storage_vectors <- lapply(
+          seq_len(extension_array$num_chunks) - 1L,
+          function(i) self$as_vector(extension_array$chunk(i))
+        )
+
+        vctrs::vec_c(!!! storage_vectors)
+      } else if (inherits(extension_array, "ExtensionArray")) {
+        extension_array$storage()$as_vector()
+      } else {
+        classes <- paste(class(extension_array), collapse = " / ")
+        abort(
+          c(
+            "`extension_array` must be a ChunkedArray or ExtensionArray",
+            i = glue::glue("Got object of type {classes}")
+          )
+        )
+      }
+    },
+
+    ToString = function() {
+      # metadata is probably valid UTF-8 (e.g., JSON), but might not be
+      # and it's confusing to error when printing the object. This herustic
+      # isn't perfect (but subclasses should override this method anyway)
+      metadata_raw <- self$Serialize()
+
+      if (as.raw(0x00) %in% metadata_raw) {
+        if (length(metadata_raw) > 20) {
+          sprintf(
+            "<%s %s...>",
+            class(self)[1],
+            paste(format(utils::head(metadata_raw, 20)), collapse = " ")
+          )
+        } else {
+          sprintf(
+            "<%s %s>",
+            class(self)[1],
+            paste(format(metadata_raw), collapse = " ")
+          )
+        }
+
+      } else {
+        paste0(class(self)[1], " <", self$SerializeUTF8(), ">")
+      }
+    }
+  )
+)
+
+# ExtensionType$new() is what gets used by the generated wrapper code to
+# create an R6 object when a shared_ptr<DataType> is returned to R and
+# that object has type_id() EXTENSION_TYPE. Rather than add complexity
+# to the wrapper code, we modify ExtensionType$new() to do what we need
+# it to do here (which is to return an instance of a custom R6
+# type whose .Deserialize method is called to populate custom fields).
+ExtensionType$.default_new <- ExtensionType$new
+ExtensionType$new <- function(xp) {
+  super <- ExtensionType$.default_new(xp)
+  r6_class <- super$r6_class()
+  if (identical(r6_class$classname, "ExtensionType")) {
+    super
+  } else {
+    r6_class$new(xp)
+  }
+}
+
+ExtensionType$create <- function(storage_type,
+                                 extension_name,
+                                 extension_metadata = raw(),
+                                 type_class = ExtensionType) {
+  if (is.string(extension_metadata)) {
+    extension_metadata <- charToRaw(enc2utf8(extension_metadata))
+  }
+
+  assert_that(is.string(extension_name), is.raw(extension_metadata))
+  assert_is(storage_type, "DataType")
+  assert_is(type_class, "R6ClassGenerator")
+
+  ExtensionType__initialize(
+    storage_type,
+    extension_name,
+    extension_metadata,
+    type_class
+  )
+}
+
+#' Extension types
+#'
+#' Extension arrays are wrappers around regular Arrow [Array] objects
+#' that provide some customized behaviour and/or storage. A common use-case
+#' for extension types is to define a customized conversion between an
+#' an Arrow [Array] and an R object when the default conversion is slow
+#' or looses metadata important to the interpretation of values in the array.
+#' For most types, the built-in
+#' [vctrs extension type][vctrs_extension_type] is probably sufficient.
+#'
+#' These functions create, register, and unregister [ExtensionType]
+#' and [ExtensionArray] objects. To use an extension type you will have to:
+#'
+#' - Define an [R6::R6Class] that inherits from [ExtensionType] and reimplement
+#'   one or more methods (e.g., `Deserialize()`).
+#' - Make a type constructor function (e.g., `my_extension_type()`) that calls
+#'   [new_extension_type()] to create an R6 instance that can be used as a
+#'   [data type][data-type] elsewhere in the package.
+#' - Make an array constructor function (e.g., `my_extension_array()`) that
+#'   calls [new_extension_array()] to create an [Array] instance of your
+#'   extension type.
+#' - Register a dummy instance of your extension type created using
+#'   you constructor function using [register_extension_type()].
+#'
+#' If defining an extension type in an R package, you will probably want to
+#' use [reregister_extension_type()] in that package's [.onLoad()] hook
+#' since your package will probably get reloaded in the same R session
+#' during its development and [register_extension_type()] will error if
+#' called twice for the same `extension_name`. For an example of an
+#' extension type that uses most of these features, see
+#' [vctrs_extension_type()].
+#'
+#' @param storage_type The [data type][data-type] of the underlying storage
+#'   array.
+#' @param storage_array An [Array] object of the underlying storage.
+#' @param extension_type An [ExtensionType] instance.
+#' @param extension_name The extension name. This should be namespaced using
+#'   "dot" syntax (i.e., "some_package.some_type"). The namespace "arrow"
+#'    is reserved for extension types defined by the Apache Arrow libraries.
+#' @param extension_metadata A [raw()] or [character()] vector containing the
+#'   serialized version of the type. Chatacter vectors must be length 1 and
+#'   are converted to UTF-8 before converting to [raw()].
+#' @param type_class An [R6::R6Class] whose `$new()` class method will be
+#'   used to construct a new instance of the type.
+#'
+#' @return
+#'   - `new_extension_type()` returns an [ExtensionType] instance according
+#'     to the `type_class` specified.
+#'   - `new_extension_array()` returns an [ExtensionArray] whose `$type`
+#'     corresponds to `extension_type`.
+#'   - `register_extension_type()`, `unregister_extension_type()`
+#'      and `reregister_extension_type()` return `NULL`, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_available()
+#' # Create the R6 type whose methods control how Array objects are
+#' # converted to R objects, how equality between types is computed,
+#' # and how types are printed.
+#' QuantizedType <- R6::R6Class(
+#'   "QuantizedType",
+#'   inherit = ExtensionType,
+#'   public = list(
+#'     # methods to access the custom metadata fields
+#'     center = function() private$.center,
+#'     scale = function() private$.scale,
+#'
+#'     # called when an Array of this type is converted to an R vector
+#'     as_vector = function(extension_array) {
+#'       if (inherits(extension_array, "ExtensionArray"))
+#'         unquantized_arrow <-
+#'           (extension_array$storage()$cast(float64()) / private$.scale) +
+#'           private$.center
+#'
+#'         as.vector(unquantized_arrow)
+#'       } else {
+#'         super$as_vector(extension_array)
+#'       }
+#'     },
+#'
+#'     # populate the custom metadata fields from the serialized metadata
+#'     Deserialize = function() {
+#'       vals <- as.numeric(strsplit(self$SerializeUTF8(), ";")[[1]])
+#'       private$.center <- vals[1]
+#'       private$.scale <- vals[2]
+#'     }
+#'   ),
+#'
+#'   private = list(
+#'     .center = NULL,
+#'     .scale = NULL
+#'   )
+#' )
+#'
+#' # Create a helper type constructor that calls new_extension_type()
+#' quantized <- function(center = 0, scale = 1, storage_type = int32()) {
+#'   new_extension_type(
+#'     storage_type = storage_type,
+#'     extension_name = "arrow.example.quantized",
+#'     extension_metadata = paste(center, scale, sep = ";"),
+#'     type_class = QuantizedType
+#'   )
+#' }
+#'
+#' # Create a helper array constructor that calls new_extension_array()
+#' quantized_array <- function(x, center = 0, scale = 1,
+#'                             storage_type = int32()) {
+#'   type <- quantized(center, scale, storage_type)
+#'   new_extension_array(
+#'     Array$create((x - center) * scale, type = storage_type),
+#'     type
+#'   )
+#' }
+#'
+#' # Register the extension type so that Arrow knows what to do when
+#' # it encounters this extension type
+#' reregister_extension_type(quantized())
+#'
+#' # Create Array objects and use them!
+#' (vals <- runif(5, min = 19, max = 21))
+#'
+#' (array <- quantized_array(
+#'   vals,
+#'   center = 20,
+#'   scale = 2 ^ 15 - 1,
+#'   storage_type = int16())
+#' )

Review comment:
       I did it to explicitly print the output (that was more useful in the initial version, which was a reprex, than it is here in the example, although the printing part does affect the pkgdown output). Maybe best to remove it if it's confusing?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r838753065



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,543 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+
+
+#' @title class arrow::ExtensionArray
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionArray` class inherits from `Array`, but also provides
+#' access to the underlying storage of the extension.
+#'
+#' - `$storage()`: Returns the underlying [Array] used to store
+#'   values.
+#'
+#' The `ExtensionArray` is not intended to be subclassed for extension
+#' types.
+#'
+#' @rdname ExtensionArray
+#' @name ExtensionArray
+#' @export
+ExtensionArray <- R6Class("ExtensionArray",
+  inherit = Array,
+  public = list(
+    storage = function() {
+      ExtensionArray__storage(self)
+    },
+
+    as_vector = function() {
+      self$type$as_vector(self)
+    }
+  )
+)
+
+ExtensionArray$create <- function(x, type) {
+  assert_is(type, "ExtensionType")
+  if (inherits(x, "ExtensionArray") && type$Equals(x$type)) {
+    return(x)
+  }
+
+  storage <- Array$create(x, type = type$storage_type())
+  type$WrapArray(storage)
+}
+
+#' @title class arrow::ExtensionType
+#'
+#' @usage NULL
+#' @format NULL
+#' @docType class
+#'
+#' @section Methods:
+#'
+#' The `ExtensionType` class inherits from `DataType`, but also defines
+#' extra methods specific to extension types:
+#'
+#' - `$storage_type()`: Returns the underlying [DataType] used to store
+#'   values.
+#' - `$storage_id()`: Returns the [Type] identifier corresponding to the
+#'   `$storage_type()`.
+#' - `$extension_name()`: Returns the extension name.
+#' - `$Serialize()`: Returns the serialized version of the extension
+#'   metadata as a [raw()] vector.
+#' - `$WrapArray(array)`: Wraps a storage [Array] into an [ExtensionArray]
+#'   with this extension type.
+#'
+#' In addition, subclasses may override the following methos to customize
+#' the behaviour of extension classes.
+#'
+#' - `$Deserialize()`: This method is called when a new [ExtensionType]
+#'   is initialized and is responsible for parsing and validating
+#'   the serialized extension_metadata (a [raw()] vector)
+#'   such that its contents can be inspected by fields and/or methods
+#'   of the R6 ExtensionType subclass. Implementations must also check the
+#'   `storage_type` to make sure it is compatible with the extension type.
+#' - `$as_vector(extension_array)`: Convert an [Array] or [ChunkedArray] to an R
+#'   vector. This method is called by [as.vector()] on [ExtensionArray]
+#'   objects, when a [RecordBatch] containing an [ExtensionArray] is
+#'   converted to a [data.frame()], or when a [ChunkedArray] (e.g., a column
+#'   in a [Table]) is converted to an R vector. The default method returns the
+#'   converted storage array.
+#' - `$ToString()` Return a string representation that will be printed
+#'   to the console when this type or an Array of this type is printed.
+#'
+#' @rdname ExtensionType
+#' @name ExtensionType
+#' @export
+ExtensionType <- R6Class("ExtensionType",
+  inherit = DataType,
+  public = list(
+
+    # In addition to the initialization that occurs for all
+    # ArrowObject instances, we call Deserialize(), which can
+    # be overridden to populate custom fields
+    initialize = function(xp) {
+      super$initialize(xp)
+      self$Deserialize()
+    },
+
+    # Because of how C++ shared_ptr<> objects are converted to R objects,
+    # the initial object that is instantiated will be of this class
+    # (ExtensionType), but the R6Class object that was registered is
+    # available from C++. We need this in order to produce the correct
+    # R6 subclass when a shared_ptr<ExtensionType> is returned to R.
+    r6_class = function() {
+      ExtensionType__r6_class(self)
+    },
+
+    storage_type = function() {
+      ExtensionType__storage_type(self)
+    },
+
+    storage_id = function() {
+      self$storage_type()$id
+    },
+
+    extension_name = function() {
+      ExtensionType__extension_name(self)
+    },
+
+    Serialize = function() {
+      ExtensionType__Serialize(self)
+    },

Review comment:
       `deserialize_extension` seems better than plain `Deserialize` (to me it reads more as being about the extension type itself and not the data...) Or maybe `deserialize_ext_metadata`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075482290


   I like it too! I like that it isn't R specific, that it needs parameterization, and that transforming it back into an R vector needs a calculation. I didn't do any explaining in the docs about why quantized types are cool...I've used them in raster GIS stuff because they enable really efficient storage (in exchange for some precision loss).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r830264810



##########
File path: r/R/arrow-package.R
##########
@@ -95,6 +100,7 @@
   }
 }
 
+

Review comment:
       ```suggestion
   ```
   
   Nit pick: undo whitespace change

##########
File path: r/R/arrow-tabular.R
##########
@@ -98,6 +98,40 @@ ArrowTabular <- R6Class("ArrowTabular",
   )
 )
 
+tabular_as_data_frame_common <- function(x, base) {

Review comment:
       `base` here is a function that is either `Table__to_dataframe ` or `RecordBatch__to_dataframe` yeah? Basically the constructor to be used if this isn't an extension type? 
   
   It might be nice to have a slightly more descriptive name for that?
   

##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray

Review comment:
       ```suggestion
   #' @include arrow-package.R
   
   #' @title class arrow::ExtensionArray
   ```
   
   These should be separate, yeah?

##########
File path: r/R/arrow-tabular.R
##########
@@ -98,6 +98,40 @@ ArrowTabular <- R6Class("ArrowTabular",
   )
 )
 
+tabular_as_data_frame_common <- function(x, base) {
+  x_cols <- names(x)
+  col_is_extension <- vapply(
+    x_cols,
+    function(col) inherits(x$schema[[col]]$type, "ExtensionType"),
+    logical(1)
+  )
+
+  if (!any(col_is_extension)) {

Review comment:
       ```suggestion
     # If no columns are ExtensionTypes, we use our standard constructor
     if (!any(col_is_extension)) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075121124


   The key step that was missing for the roundtrip was `register_extension_type()`, which is needed so that Arrow C++ knows not to discard the extension metadata when it encounters the type! (see details).
   
   I should probably export `ExtensionArray` and use `ExtensionArray$create()` rather than `new_extension_array()` since it's more arrow-ish to do that. Maybe `ExtensionType$create()` instead of `new_extension_type()` is where extension type creation should go, too.
   
   Printing is a good point...definitely confusing in the case of an extension type!
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   KoreanAge <- R6::R6Class(
     "KoreanAge", 
     inherit = ExtensionType,
     public = list(
       .array_as_vector = function(extension_array) {
         extension_array$storage()$as_vector() + 1
       }
     )
   )
   
   # constructor helpers
   korean_age <- function() {
     new_extension_type(
       int32(),
       "KoreanAge",
       charToRaw("Korean Age, but stored as the western age value"),
       type_class = KoreanAge
     )
   }
   
   korean_age_array <- function(age_korean) {
     new_extension_array(age_korean - 1, korean_age())
   }
   
   (arr <- korean_age_array(1:3))
   #> ExtensionArray
   #> <KoreanAge <Korean Age, but stored as the western age value>>
   #> [
   #>   0,
   #>   1,
   #>   2
   #> ]
   as.vector(arr)
   #> [1] 1 2 3
   
   # you need to register the type for Arrow C++ to keep the extension type
   # slash metadata when it encounters it at the C++ level (import from C
   # and reading files)
   register_extension_type(korean_age())
   
   tf <- tempfile()
   write_feather(arrow_table(col = arr), tf)
   
   tab <- read_feather(tf, as_data_frame = FALSE)
   
   type(tab$col)
   #> KoreanAge
   #> KoreanAge <Korean Age, but stored as the western age value>
   
   as.vector(tab$col)
   #> [1] 0 1 2
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075190581


   Ah yes, you *clearly* should have remembered to implement `KoreanAge$.chunked_array_as_vector()` (...I'll implement a better default method for that one...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#issuecomment-1075211801


   > Printing is a good point...definitely confusing in the case of an extension type!
   
   I think we have to punt on the printing...there isn't a way to customize how other Array objects are printed (it all goes through `ChunkedArray::ToString()`/`Array::ToString()` at the C++ level and that is very specifically a `PrettyPrint()` C++ thing. We have some tools available in R to improve the printing of all ArrowTabular/ChunkedArray/Arrays, and maybe a PR with that as its scope would be more appropriate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jonkeane commented on a change in pull request #12467: ARROW-15471: [R] ExtensionType support in R

Posted by GitBox <gi...@apache.org>.
jonkeane commented on a change in pull request #12467:
URL: https://github.com/apache/arrow/pull/12467#discussion_r831463758



##########
File path: r/R/extension.R
##########
@@ -0,0 +1,437 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#' @include arrow-package.R
+#' @title class arrow::ExtensionArray

Review comment:
       It changes the collation order to make sure that this file is marked as being after arrow-package.R. 
   
   IME, I've always put them up at the top of this file (since they are only file level...), but it looks like we have places elsewhere where we add them in the function roxygen chunks like this, so we should probably follow that style?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org