You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "achapkowski (via GitHub)" <gi...@apache.org> on 2023/11/02 12:24:19 UTC

[I] Custom data types in arrow array [arrow]

achapkowski opened a new issue, #38559:
URL: https://github.com/apache/arrow/issues/38559

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hello,
   
   Given a simple class like this:
   
   ```python
   import pyarrow as pa
   
   class Point:
       def __init__(self, iterable):
           self._data = iterable    
       @property
       def data(self) -> dict:
           return self._data
   
   
   
   pt = Point({"x" : 1,"y": 2,})
   dataset = [pt]
   
   pa.chunked_array([dataset])
   ```
   
   How can I create a custom data type so it can be stored in the pyarrow Array or ChunkArray?
   
   When I try to add a list of these item types, I get the following:
   
   ```pyarrow.lib.ArrowInvalid: Could not convert <Point object at ...> with type Point: did not recognize Python value type when inferring an Arrow data type```
   
   Is there a way to allow for these custom objects to exist in an arrow array?
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Custom data types in arrow array [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.
llama90 commented on issue #38559:
URL: https://github.com/apache/arrow/issues/38559#issuecomment-1790864497

   Um... There are [Defining extension types](https://arrow.apache.org/docs/python/extending_types.html#defining-extension-types-user-defined-types). Looking at the example, it seems that explicit conversion is necessary for these extended types...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Custom data types in arrow array [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.
llama90 commented on issue #38559:
URL: https://github.com/apache/arrow/issues/38559#issuecomment-1790681605

   Here's what I found.
   
   In the given code, the steps are necessary because `pyarrow` cannot automatically infer the type of the user-defined `Point` objects within the dataset.
   
   When creating an Arrow array, the data type needs to be compatible with Arrow's type system. 
   
   Therefore, the `structured_array` is created to convert the `Point` objects into a format that `pyarrow` can understand, which is a list of dictionaries in this case. 
   
   Then, when calling `pa.array`, the type argument is used to explicitly specify the schema of the Arrow array, ensuring that the data is correctly typed as a structured array with fields `x` and `y` of type int64.
   
   ```python
   import pyarrow as pa
   
   
   class Point:
       def __init__(self, iterable):
           self._data = iterable
   
       @property
       def data(self) -> dict:
           return self._data
   
   
   if __name__ == '__main__':
       pt1 = Point({"x": 1, "y": 2, })
       pt2 = Point({"x": 3, "y": 4, })
       pt3 = Point({"x": 5, "y": 6, })
       pt4 = Point({"x": 7, "y": 8, })
   
       dataset = [pt1, pt2, pt3, pt4]
       structured_array = [pt.data for pt in dataset]
       arrow_array = pa.array(structured_array, type=pa.struct([('x', pa.int64()), ('y', pa.int64())]))
       chunked_array = pa.chunked_array([arrow_array])
       print(chunked_array)
   
   --- output
   
   [
     -- is_valid: all not null
     -- child 0 type: int64
       [
         1,
         3,
         5,
         7
       ]
     -- child 1 type: int64
       [
         2,
         4,
         6,
         8
       ]
   ]
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Custom data types in arrow array [arrow]

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on issue #38559:
URL: https://github.com/apache/arrow/issues/38559#issuecomment-1793312045

   @achapkowski you need to register an extension type according to the above doc. For an example of a point extension type, see https://github.com/geoarrow/geoarrow-python/pull/2/files#diff-ac5f6fc8e4244a9d670057b61cb59405d311917b8763fb1148691c11f59ce585R176-R196 (not yet merged because we're still figuring out exactly the layout of geoarrow-python)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Custom data types in arrow array [arrow]

Posted by "achapkowski (via GitHub)" <gi...@apache.org>.
achapkowski commented on issue #38559:
URL: https://github.com/apache/arrow/issues/38559#issuecomment-1790791163

   There is no way to register a custom data type with arrow?  My goal was to not have to convert a dataset and iterate over it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Custom data types in arrow array [arrow]

Posted by "mariosasko (via GitHub)" <gi...@apache.org>.
mariosasko commented on issue #38559:
URL: https://github.com/apache/arrow/issues/38559#issuecomment-1799267483

   The automatic type inference for extension types would also be useful for the HF `datasets` project, so I suggested a similar thing in https://github.com/apache/arrow/issues/35647#issuecomment-1559886902


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org