You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/02 10:59:18 UTC

[GitHub] [arrow] jorisvandenbossche opened a new issue, #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

jorisvandenbossche opened a new issue, #33997:
URL: https://github.com/apache/arrow/issues/33997

   When wrapping a type (or array) in a pyarrow object, we need to define which Python class to use. Currently, for extension types, this logic lives here in `pyarrow_wrap_data_type`:
   
   https://github.com/apache/arrow/blob/b413ac4f2b6911af5e8241803277caccc43aa3c4/python/pyarrow/public-api.pxi#L114-L120
   
   So there are currently two options:
   
   - The ExtensionType is implemented in Python, by subclassing `pyarrow.(Py)ExtensionType`, and which links to the C++ `arrow::py::PyExtensionType` (a subclass of `arrow::ExtensionType`). In this case, we store the python type instance on the C++ instance, and return this as python object in `pyarrow_wrap_data_type`.
   - The ExtensionType is implemented in C++, and then we currently always fall back to wrap this in the `pyarrow.BaseExtenstionType` base class (there is currently a bug in this, but that is getting fixed in [GH-33802](https://github.com/apache/arrow/pull/33802)).
   
   However, that means that for such extension types implemented in C++, there is currently no way to have a "richer" python Type object (or Array object, since that is determined by the Type, and for a BaseExtensionType, that will always use the base ExtensionArray). While for an extension type, you might want to add type-specific attributes or methods. 
   
   For canonical extension types that are implemented in Arrow C++ itself (for example, the currently discussed Tensor extension type in https://github.com/apache/arrow/pull/8510, or a previous effort to add complex type as extension type in https://github.com/apache/arrow/pull/10565), I think it will work today to create a custom subclass of `pyarrow.BaseExtensionType` for the specific canonical type, and then we could add a special case to `pyarrow_wrap_data_type` checking the name of the extension type, and if it is a canonical one we implement ourselves, use the python subclass we implemented ourselves. 
   
   But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work. 
   I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on issue #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on issue #33997:
URL: https://github.com/apache/arrow/issues/33997#issuecomment-1438928419

   > But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
   I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for
   
   https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/python/pyarrow/types.pxi#L1095-L1100
   
   Perhaps one way to do this would be to modify `pyarrow.register_extension_type` to check for collision with an existing C++ Extension when calling `RegisterPyExtensionType`. If the name and storage type match, then the associated Python type can be registered in `_python_extension_types_registry`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on issue #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #33997:
URL: https://github.com/apache/arrow/issues/33997#issuecomment-1732437377

   I'm looking to subclass `FixedShapeTensorType` (which is difficult as it's a parametric extension type without `__init__`). @sjperkins did you resolve this in some way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on issue #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on issue #33997:
URL: https://github.com/apache/arrow/issues/33997#issuecomment-1734912989

   @rok I didn't get too far with this.
   
   - Its been a while, but https://github.com/apache/arrow/pull/34483 dynamically created Python types (Extension, Array, Scalar) associated with the C++ extensions. The idea was that the Python developer could then build Python APIs on top of these dynamic types. However, creating dynamic types was considered a bit too exotic.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rok commented on issue #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

Posted by "rok (via GitHub)" <gi...@apache.org>.
rok commented on issue #33997:
URL: https://github.com/apache/arrow/issues/33997#issuecomment-1735151486

   If I'm reading the discussion (in #34483 and other threads) correctly the exotic part is automatic generation of types over explicit manual definition for canonical types only? Otherwise the proposed approach would be suitable, right?
   Would that work for your usecase?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sjperkins commented on issue #33997: [Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++

Posted by "sjperkins (via GitHub)" <gi...@apache.org>.
sjperkins commented on issue #33997:
URL: https://github.com/apache/arrow/issues/33997#issuecomment-1737518311

   > If I'm reading the discussion (in #34483 and other threads) correctly the exotic part is automatic generation of types over explicit manual definition for canonical types only? Otherwise the proposed approach would be suitable, right? Would that work for your usecase?
   
   Yes, I think it would.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org