You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/06/30 19:05:09 UTC

[GitHub] [arrow-rs] wjones127 opened a new issue, #4472: Add an ExtensionType to DataType enum

wjones127 opened a new issue, #4472:
URL: https://github.com/apache/arrow-rs/issues/4472

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Extension types are annotated in field metadata. This works well with record batches, but when exporting/importing an array over the C data interface, the extension type metadata is lost.
   
   The C++ implementation solves this by having an ExtensionType class and always exports that metadata over C data interface here:
   
   https://github.com/apache/arrow/blob/b9aec9ad2b655817b8925462e4e2dd6973807e23/cpp/src/arrow/c/bridge.cc#L243-L252
   
   **Describe the solution you'd like**
   
   I'd propose adding a new enum variant to DataType:
   
   ```rust
   struct ExtensionType {
      name: Box<str>,
      metadata: Box<str>,
      storage_type: Box<DataType>,
   }
   
   enum DataType {
       ...
       ExtensionType(ExtensionType)
   }
   ```
   
   Then make sure the C data interface implementation handles exporting and importing this type.
   
   
   **Describe alternatives you've considered**
   
   We could add an extension type registry like C++, but that seems heavier than we really need.
   
   **Additional context**
   
   https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays
   
   Previous discussions:
   
    * #2444
    * #218
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1779870321

   > are you proposing something like the following?
   
   No I'm proposing not making changes to DataType and using the Field metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1807180541

   > This feels like a limitation of the way the python conversion is implemented
   
   For Python conversion specifically, this might be solved with the new PyCapsule interface (ref #5067) because the `ArrowSchema` FFI struct is generated by pyarrow itself, and so it doesn't have to be inferred from `array.type`. (I haven't verified how it works with extension arrays yet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] yukkit commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "yukkit (via GitHub)" <gi...@apache.org>.
yukkit commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1698685348

   I encountered the same trouble when trying to add a custom type. Although the extension type can be marked through the metadata of FIeld, the metadata will be lost in the scene of array processing, for example: udf in datafusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1811436993

   FYI @yukkit  has created a PR showing how `LogicalType`s might work in DataFusion: https://github.com/apache/arrow-datafusion/pull/8143. It is a pretty neat idea. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1615258370

   > I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?
   
   This works well for RecordBatch, but not for an individual array transported independent from any batch. Basically, arrays themselves have no way to be tagged as an extension array, since those don't contain a field where that metadata is stored; they are only extension arrays in the context of a batch.
   
   > I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types.
   
   I definitely agree, and don't want to make these operations more complex than they ought to be. 
   
   If we can think of another place to put this information, I'm open to that.
   
   (A bit of a tangent, but...) In my ideal world, there would be a logical type enum and a physical type enum. Physical types would be the current `DataType`. Then logical types would be things like `String` (just one, regardless of offset size and encoding) and then a generic `ExtensionType` variant. Sort of like what Sasha was talking about a long time ago: https://lists.apache.org/thread/357z4587dczho4x1257ttf0b4o9302co


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1779887158

   > No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information. 
   
   So how would we implement @kylebarron 's use case? Perhaps via a RecordBatch (with a single column)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1780625868

   RecordBatch/Schema is but one way that users might choose to expose logical type information, they might also define their own array abstractions that wrap arrow arrays, or their own schema abstraction at plan time, etc...
   
   As @kylebarron rightly states it's a zero-sum API challenge, either all of arrow must become aware of extension types, or it is confined to the areas that actually care. It seems odd to me to optimize the design here for things that are not present in the specification at the expense of everything else, further it seems unfortunate to optimise for one particular way of encoding logical type information.
   
   I don't have all the answers here, I don't know what a general purpose logical type abstraction looks like, if such a thing even exists, but it does seem that the core library shouldn't be opinionated in this regard


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1779764006

   @yukkit  is contemplating User Defined Types in DataFusion, and the arrow extension type mechanism is the obvious implementation I think -- see  https://github.com/apache/arrow-datafusion/issues/7923
   
   > I personally think Field is the correct location for such metadata,
   
   @tustvold are you proposing something like the following? 
   
   ```rust
   enum DataType { 
   ....
      List(FieldRef),
      /// Extension type, with potentially embedded metadata in the field reference
      Extension(FieldRef)
   }
   ```
   
   This proposal runs afoul of how [`DataType::List`](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.List) works today (where the field name is mostly irrelevant ("item")), but I don't really have any better ideas. 
   
   I think this structure would allow @kylebarron  to implement
   
   ```rust
   impl TryFrom<&dyn Array> for GeometryArray {
     fn try_from(arr: &dyn Array>) -> Result<Self> {
       match arr.data_type() {
         DataType::Extension(field) if is_geo_type(field.metadata()) => {
           .... do the conversion ...
         }
         dt => Err("Unsupported datatype: {dt}
      }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1749805039

   I'm taking a stab at migrating my [`geoarrow-rs` crate](https://github.com/geoarrow/geoarrow-rs) (which implements the [GeoArrow extension array spec](https://geoarrow.org/)) from arrow2 to arrow-rs, and wanted to add that I'm feeling the omission of a `DataType::Extension` variant in arrow-rs.
   
   In particular, a geospatial algorithm would have to return a `Field` with _every operation_, because the physical layout of a `LineStringArray` is exactly the same as that of a `MultiPointArray` (and `PolygonArray`/`MultiLineStringArray`). Maybe this is nitpicking, but it I've liked the level of abstraction of having the extension metadata on the `DataType`, because the operations on the array are separate from a named column in a table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1615757110

   > If we can think of another place to put this information, I'm open to that.
   
   I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level. Whilst it isn't a true logical vs physical separation, I think that is a helpful way to conceptualize it.
   
   > But it would be nice to find a way to not have to do that.
   
   This feels like a limitation of the way the python conversion is implemented
   
   ```
   impl ToPyArrow for ArrayData {
       fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
           let array = FFI_ArrowArray::new(self);
           let schema = FFI_ArrowSchema::try_from(self.data_type()).map_err(to_py_err)?;
   
           let module = py.import("pyarrow")?;
           let class = module.getattr("Array")?;
           let array = class.call_method1(
               "_import_from_c",
               (
                   addr_of!(array) as Py_uintptr_t,
                   addr_of!(schema) as Py_uintptr_t,
               ),
           )?;
           Ok(array.to_object(py))
       }
   }
   ```
   
   In particular the schema is inferred from the array's data type. If instead there were a way to provide a `Field` then this would allow propagating not only extension metadata, but also nullability, dictionary ordering, etc... It would also potentially allow performing the schema conversion once and using the result for multiple arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1631210105

   > I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level.
   
   I think that could be a decent approach, although I'm still trying to understand what that would look like. It sounds like the arrow-rs type system is closed, but can be wrapped in a higher-level type system. (whereas, the C++ kernels are extension-aware).
   
   So it sounds like the point to add extension types is when building extensions in datafusion. Eventually, I think it would be nice to have an example in arrow-datafusion showing how to add support for a simple extension type (such as UUID) in the engine. Basically the end result would be showing that 
   
   ```sql
   SELECT gen_random_uuid()
   ```
   
   outputs
   
   ```
   +--------------------------------------+
   | gen_random_uuid()                    |
   +--------------------------------------+
   | eeccb8c5-9943-b2bb-bb5e-222f4e14b687 |
   +--------------------------------------+
   
   ```
   
   Showing that you can add methods that output extension types (`gen_random_uuid()`) and that you can control how those extension types are displayed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1780298531

   @tustvold  do you think `RecordBatch` is something that end-users should be seeing? Or do you imagine this should be a hidden implementation detail in all cases?
   
   If it's the latter, I think I can understand the position to keep extension type separate. But if it's the former, it's hard to see how we can provide a decent UX without bringing the extension type into the array itself. For example, if we return a RecordBatches with a UUID, a user might reasonable surprised that the UUID column prints as the raw bytes and not a hyphenated UUID string. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1615194168

   > but when exporting/importing an array over the C data interface, the extension type metadata is lost
   
   I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?
   
   > I'd propose adding a new enum variant to DataType:
   
   My major objection to this approach is that it undermines the transparency of extension types. In particular array downcasting logic must now be updated to "see through" DataType::Extension, and arrays must be updated to be able to propagate this information correctly. I would much prefer an approach that does not require this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on issue #4472: Add an ExtensionType to DataType enum

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1615260253

   FWIW my workaround for now is to just wrap it in a record batch and unwrap on the other side. But it would be nice to find a way to not have to do that.
   
   https://github.com/lancedb/lance/blob/3e1ed67acf7d1de336b6d211647d9581c64c3bed/python/src/arrow.rs#L53-L67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1779914639

   You would need whatever performs the kernel selection to have the Field, most likely via Schema, I'm not sure you necessarily need this information afterwards?
   
   For example, the DF PhysicalExpr could have already extracted the necessary metadata at plan time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Add an ExtensionType to DataType enum [arrow-rs]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #4472:
URL: https://github.com/apache/arrow-rs/issues/4472#issuecomment-1779894768

   My interpretation of this is that it's a "zero-sum" architecture decision, in the sense that if you don't want to conflate logical and physical types in the `DataType` enum, then there's intentionally no way to implement `From` on `&dyn Array`; instead it's only possible to implement it on `(&dyn Array, &FieldRef)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org