You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2023/04/12 08:37:29 UTC

[arrow] branch main updated: GH-34956: [Docs][Python] Add to docs the usage of the FixedShapeTensorType (#34957)

This is an automated email from the ASF dual-hosted git repository.

alenka pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new b8427d391f GH-34956: [Docs][Python] Add to docs the usage of the FixedShapeTensorType (#34957)
b8427d391f is described below

commit b8427d391f77454f7009cf7b8091037fd77f01c6
Author: Alenka Frim <Al...@users.noreply.github.com>
AuthorDate: Wed Apr 12 10:37:14 2023 +0200

    GH-34956: [Docs][Python] Add to docs the usage of the FixedShapeTensorType (#34957)
    
    ### Rationale for this change
    
    This PR adds examples of the use of `FixedShapeTensorType`to the PyArrow user guide. Should be reviewed and merged after https://github.com/apache/arrow/pull/34883 is done.
    * Closes: #34956
    
    Lead-authored-by: Alenka Frim <fr...@gmail.com>
    Co-authored-by: Alenka Frim <Al...@users.noreply.github.com>
    Co-authored-by: Rok Mihevc <ro...@mihevc.org>
    Signed-off-by: Alenka Frim <fr...@gmail.com>
---
 docs/source/python/extending_types.rst | 160 +++++++++++++++++++++++++++++++++
 1 file changed, 160 insertions(+)

diff --git a/docs/source/python/extending_types.rst b/docs/source/python/extending_types.rst
index 9b6743cb10..53ce70e13b 100644
--- a/docs/source/python/extending_types.rst
+++ b/docs/source/python/extending_types.rst
@@ -357,3 +357,163 @@ pandas ``ExtensionArray``. This method should have the following signature::
 
 This way, you can control the conversion of a pyarrow ``Array`` of your pyarrow
 extension type to a pandas ``ExtensionArray`` that can be stored in a DataFrame.
+
+
+Canonical extension types
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can find the official list of canonical extension types in the
+:ref:`format_canonical_extensions` section. Here we add examples on how to
+use them in pyarrow.
+
+Fixed size tensor
+"""""""""""""""""
+
+To create an array of tensors with equal shape (fixed shape tensor array) we
+first need to define a fixed shape tensor extension type with value type
+and shape:
+
+.. code-block:: python
+
+   >>> tensor_type = pa.fixed_shape_tensor(pa.int32(), (2, 2))
+
+Then we need the storage array with :func:`pyarrow.list_` type where ``value_type```
+is the fixed shape tensor value type and list size is a product of ``tensor_type``
+shape elements. Then we can create an array of tensors with
+``pa.ExtensionArray.from_storage()`` method:
+
+.. code-block:: python
+
+   >>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
+   >>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
+   >>> tensor_array = pa.ExtensionArray.from_storage(tensor_type, storage)
+
+We can also create another array of tensors with different value type:
+
+.. code-block:: python
+
+   >>> tensor_type_2 = pa.fixed_shape_tensor(pa.float32(), (2, 2))
+   >>> storage_2 = pa.array(arr, pa.list_(pa.float32(), 4))
+   >>> tensor_array_2 = pa.ExtensionArray.from_storage(tensor_type_2, storage_2)
+
+Extension arrays can be used as columns in  ``pyarrow.Table`` or
+``pyarrow.RecordBatch``:
+
+.. code-block:: python
+
+   >>> data = [
+   ...     pa.array([1, 2, 3]),
+   ...     pa.array(['foo', 'bar', None]),
+   ...     pa.array([True, None, True]),
+   ...     tensor_array,
+   ...     tensor_array_2
+   ... ]
+   >>> my_schema = pa.schema([('f0', pa.int8()),
+   ...                        ('f1', pa.string()),
+   ...                        ('f2', pa.bool_()),
+   ...                        ('tensors_int', tensor_type),
+   ...                        ('tensors_float', tensor_type_2)])
+   >>> table = pa.Table.from_arrays(data, schema=my_schema)
+   >>> table
+   pyarrow.Table
+   f0: int8
+   f1: string
+   f2: bool
+   tensors_int: extension<arrow.fixed_size_tensor>
+   tensors_float: extension<arrow.fixed_size_tensor>
+   ----
+   f0: [[1,2,3]]
+   f1: [["foo","bar",null]]
+   f2: [[true,null,true]]
+   tensors_int: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]
+   tensors_float: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]
+
+We can also convert a tensor array to a single multi-dimensional numpy ndarray.
+With the conversion the length of the arrow array becomes the first dimension
+in the numpy ndarray:
+
+.. code-block:: python
+
+   >>> numpy_tensor = tensor_array_2.to_numpy_ndarray()
+   >>> numpy_tensor
+   array([[[  1.,   2.],
+           [  3.,   4.]],
+          [[ 10.,  20.],
+           [ 30.,  40.]],
+          [[100., 200.],
+           [300., 400.]]])
+    >>> numpy_tensor.shape
+   (3, 2, 2)
+
+.. note::
+
+   Both optional parameters, ``permutation`` and ``dim_names``, are meant to provide the user
+   with the information about the logical layout of the data compared to the physical layout.
+
+   The conversion to numpy ndarray is only possible for trivial permutations (``None`` or
+   ``[0, 1, ... N-1]`` where ``N`` is the number of tensor dimensions).
+
+And also the other way around, we can convert a numpy ndarray to a fixed shape tensor array:
+
+.. code-block:: python
+
+   >>> pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)
+   <pyarrow.lib.FixedShapeTensorArray object at ...>
+   [
+     [
+       1,
+       2,
+       3,
+       4
+     ],
+     [
+       10,
+       20,
+       30,
+       40
+     ],
+     [
+       100,
+       200,
+       300,
+       400
+     ]
+   ]
+
+With the conversion the first dimension of the ndarray becomes the length of the pyarrow extension
+array. We can see in the example that ndarray of shape ``(3, 2, 2)`` becomes an arrow array of
+length 3 with tensor elements of shape ``(2, 2)``.
+
+.. code-block:: python
+
+   # ndarray of shape (3, 2, 2)
+   >>> numpy_tensor.shape
+   (3, 2, 2)
+
+   # arrow array of length 3 with tensor elements of shape (2, 2)
+   >>> pyarrow_tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)
+   >>> len(pyarrow_tensor_array)
+   3
+   >>> pyarrow_tensor_array.type.shape
+   [2, 2]
+
+The extension type can also have ``permutation`` and ``dim_names`` defined. For
+example
+
+.. code-block:: python
+
+    >>> tensor_type = pa.fixed_shape_tensor(pa.float64(), [2, 2, 3], permutation=[0, 2, 1])
+
+or
+
+.. code-block:: python
+
+    >>> tensor_type = pa.fixed_shape_tensor(pa.bool_(), [2, 2, 3], dim_names=['C', 'H', 'W'])
+
+for ``NCHW`` format where:
+
+* N: number of images which is in our case the length of an array and is always on
+  the first dimension
+* C: number of channels of the image
+* H: height of the image
+* W: width of the image