You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/01 09:32:09 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33948: GH-33947: [Python][Docs] Tensor canonical type extension example

jorisvandenbossche commented on code in PR #33948:
URL: https://github.com/apache/arrow/pull/33948#discussion_r1092960115


##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1079,3 +1082,279 @@ def test_array_constructor_from_pandas():
         pd.Series([1, 2, 3], dtype="category"), type=IntegerType()
     )
     assert result.equals(expected)
+
+
+class TensorType(pa.ExtensionType):
+    """
+    Canonical extension type class for fixed shape tensors.
+
+    Parameters
+    ----------
+    value_type : DataType or Field
+        The data type of an individual tensor
+    shape : tuple
+        shape of the tensors
+    is_row_major : bool
+        boolean indicating the order of elements
+        in memory
+
+    Examples
+    --------
+    >>> import pyarrow as pa
+    >>> tensor_type = TensorType(pa.int32(), (2, 2), 'C')
+    >>> tensor_type
+    TensorType(FixedSizeListType(fixed_size_list<item: int32>[4]))
+    >>> pa.register_extension_type(tensor_type)
+    """
+
+    def __init__(self, value_type, shape, is_row_major):
+        self._value_type = value_type
+        self._shape = shape
+        self._is_row_major = is_row_major
+        size = math.prod(shape)
+        pa.ExtensionType.__init__(self, pa.list_(self._value_type, size),
+                                  'arrow.fixed_size_tensor')
+
+    @property
+    def dtype(self):

Review Comment:
   ```suggestion
       def value_type(self):
   ```
   
   (to match the parameter)



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1079,3 +1082,279 @@ def test_array_constructor_from_pandas():
         pd.Series([1, 2, 3], dtype="category"), type=IntegerType()
     )
     assert result.equals(expected)
+
+
+class TensorType(pa.ExtensionType):

Review Comment:
   Use `FixedShapeTensorType`, to follow the name change?
   
   And the same for the TensorArray



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1079,3 +1082,279 @@ def test_array_constructor_from_pandas():
         pd.Series([1, 2, 3], dtype="category"), type=IntegerType()
     )
     assert result.equals(expected)
+
+
+class TensorType(pa.ExtensionType):
+    """
+    Canonical extension type class for fixed shape tensors.
+
+    Parameters
+    ----------
+    value_type : DataType or Field
+        The data type of an individual tensor
+    shape : tuple
+        shape of the tensors
+    is_row_major : bool
+        boolean indicating the order of elements
+        in memory
+
+    Examples
+    --------
+    >>> import pyarrow as pa
+    >>> tensor_type = TensorType(pa.int32(), (2, 2), 'C')
+    >>> tensor_type
+    TensorType(FixedSizeListType(fixed_size_list<item: int32>[4]))
+    >>> pa.register_extension_type(tensor_type)
+    """
+
+    def __init__(self, value_type, shape, is_row_major):
+        self._value_type = value_type
+        self._shape = shape
+        self._is_row_major = is_row_major
+        size = math.prod(shape)
+        pa.ExtensionType.__init__(self, pa.list_(self._value_type, size),
+                                  'arrow.fixed_size_tensor')
+
+    @property
+    def dtype(self):
+        """
+        Data type of an individual tensor.
+        """
+        return self._value_type
+
+    @property
+    def shape(self):
+        """
+        Shape of the tensors.
+        """
+        return self._shape
+
+    @property
+    def is_row_major(self):
+        """
+        Boolean indicating the order of elements in memory.
+        """
+        return self._is_row_major
+
+    def __arrow_ext_serialize__(self):
+        metadata = {"shape": str(self._shape),
+                    "is_row_major": self._is_row_major}
+        return json.dumps(metadata).encode()
+
+    @classmethod
+    def __arrow_ext_deserialize__(cls, storage_type, serialized):
+        # return an instance of this subclass given the serialized
+        # metadata.
+        assert serialized.decode().startswith('{"shape":')
+        metadata = json.loads(serialized.decode())
+        shape = ast.literal_eval(metadata['shape'])
+        order = metadata["is_row_major"]
+
+        return TensorType(storage_type.value_type, shape, order)
+
+    def __arrow_ext_class__(self):
+        return TensorArray
+
+
+class TensorArray(pa.ExtensionArray):
+    """
+    Canonical extension array class for fixed shape tensors.
+
+    Examples
+    --------
+    Define and register extension type for tensor array
+
+    >>> import pyarrow as pa
+    >>> tensor_type = TensorType(pa.int32(), (2, 2), 'C')
+    >>> pa.register_extension_type(tensor_type)
+
+    Create an extension array
+
+    >>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
+    >>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
+    >>> pa.ExtensionArray.from_storage(tensor_type, storage)
+    <__main__.TensorArray object at 0x1491a5a00>
+    [
+      [
+        1,
+        2,
+        3,
+        4
+      ],
+      [
+        10,
+        20,
+        30,
+        40
+      ],
+      [
+        100,
+        200,
+        300,
+        400
+      ]
+    ]
+    """
+
+    def to_numpy_tensor_list(self):

Review Comment:
   It might be more useful to showcase conversion to/from a single numpy array (with ndim+1)?



##########
python/pyarrow/tests/test_extension_type.py:
##########
@@ -1079,3 +1082,279 @@ def test_array_constructor_from_pandas():
         pd.Series([1, 2, 3], dtype="category"), type=IntegerType()
     )
     assert result.equals(expected)
+
+
+class TensorType(pa.ExtensionType):
+    """
+    Canonical extension type class for fixed shape tensors.
+
+    Parameters
+    ----------
+    value_type : DataType or Field
+        The data type of an individual tensor
+    shape : tuple
+        shape of the tensors
+    is_row_major : bool
+        boolean indicating the order of elements
+        in memory
+
+    Examples
+    --------
+    >>> import pyarrow as pa
+    >>> tensor_type = TensorType(pa.int32(), (2, 2), 'C')
+    >>> tensor_type
+    TensorType(FixedSizeListType(fixed_size_list<item: int32>[4]))
+    >>> pa.register_extension_type(tensor_type)
+    """
+
+    def __init__(self, value_type, shape, is_row_major):
+        self._value_type = value_type
+        self._shape = shape
+        self._is_row_major = is_row_major
+        size = math.prod(shape)
+        pa.ExtensionType.__init__(self, pa.list_(self._value_type, size),
+                                  'arrow.fixed_size_tensor')
+
+    @property
+    def dtype(self):
+        """
+        Data type of an individual tensor.
+        """
+        return self._value_type
+
+    @property
+    def shape(self):
+        """
+        Shape of the tensors.
+        """
+        return self._shape
+
+    @property
+    def is_row_major(self):
+        """
+        Boolean indicating the order of elements in memory.
+        """
+        return self._is_row_major
+
+    def __arrow_ext_serialize__(self):
+        metadata = {"shape": str(self._shape),
+                    "is_row_major": self._is_row_major}
+        return json.dumps(metadata).encode()
+
+    @classmethod
+    def __arrow_ext_deserialize__(cls, storage_type, serialized):
+        # return an instance of this subclass given the serialized
+        # metadata.
+        assert serialized.decode().startswith('{"shape":')
+        metadata = json.loads(serialized.decode())
+        shape = ast.literal_eval(metadata['shape'])
+        order = metadata["is_row_major"]
+
+        return TensorType(storage_type.value_type, shape, order)
+
+    def __arrow_ext_class__(self):
+        return TensorArray
+
+
+class TensorArray(pa.ExtensionArray):
+    """
+    Canonical extension array class for fixed shape tensors.
+
+    Examples
+    --------
+    Define and register extension type for tensor array
+
+    >>> import pyarrow as pa
+    >>> tensor_type = TensorType(pa.int32(), (2, 2), 'C')
+    >>> pa.register_extension_type(tensor_type)
+
+    Create an extension array
+
+    >>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
+    >>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
+    >>> pa.ExtensionArray.from_storage(tensor_type, storage)
+    <__main__.TensorArray object at 0x1491a5a00>
+    [
+      [
+        1,
+        2,
+        3,
+        4
+      ],
+      [
+        10,
+        20,
+        30,
+        40
+      ],
+      [
+        100,
+        200,
+        300,
+        400
+      ]
+    ]
+    """
+
+    def to_numpy_tensor_list(self):
+        """
+        Convert tensor extension array to a list of numpy tensors (ndarrays).
+        """
+        tensors = []
+        for tensor in self.storage:
+            np_flat = np.array(tensor.as_py())
+            order = 'C' if self.type.is_row_major else 'F'
+            numpy_tensor = np_flat.reshape((self.type.shape),
+                                           order=order)
+            tensors.append(numpy_tensor)
+        return tensors
+
+    def from_numpy_tensor_list(obj):
+        """
+        Convert a list of numpy tensors (ndarrays) to a tensor extension array.
+        """
+        numpy_type = obj[0].flatten().dtype
+        arrow_type = pa.from_numpy_dtype(numpy_type)
+        shape = obj[0].shape
+        is_row_major = False if np.isfortran(obj[0]) else True
+        size = obj[0].size
+
+        tensor_list = []
+        for tensor in obj:
+            tensor_list.append(tensor.flatten())
+
+        return pa.ExtensionArray.from_storage(
+            TensorType(arrow_type, shape, is_row_major),
+            pa.array(tensor_list, pa.list_(arrow_type, size))
+        )
+
+
+@pytest.fixture
+def registered_tensor_type():
+    # setup
+    tensor_type = TensorType(pa.int8(), (2, 2, 3), True)
+    tensor_class = tensor_type.__arrow_ext_class__()
+    pa.register_extension_type(tensor_type)
+    yield tensor_type, tensor_class
+    # teardown
+    try:
+        pa.unregister_extension_type('arrow.fixed_size_tensor')
+    except KeyError:
+        pass
+
+
+def test_generic_ext_type_tensor():

Review Comment:
   I think you can remove the "generic_ext_type" part of the test names everywhere (that's from tests above that contrasted tests for py extension types with generic (non-python specific) extension types)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org