You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alenka Frim <al...@voltrondata.com.INVALID> on 2023/02/21 12:38:48 UTC

[VOTE] Format: Fixed shape tensor Canonical Extension Type

Hi all,

I would like to propose we vote on adding the fixed shape tensor canonical
extension type
with the following specification:

Fixed shape tensor
==================

* Extension name: `arrow.fixed_shape_tensor`.

* The storage type of the extension: ``FixedSizeList`` where:

  * **value_type** is the data type of individual tensors and
    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
  * **list_size** is the product of all the elements in tensor shape.

* Extension type parameters:

  * **value_type** = Arrow DataType of the tensor elements
  * **shape** = shape of the contained tensors as an array

  Optional parameters:

  * **dim_names** = explicit names to tensor dimensions
    as an array. The length of it should be equal to the shape
    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known
    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where
    N is the number of dimensions. The permutation indicates which
    dimension of the logical layout corresponds to which dimension of the
    physical tensor (the i-th dimension of the logical view corresponds
    to the dimension with number ``permutations[i]`` of the physical tensor).

    Permutation can be useful in case the logical order of
    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always
    be ([0, 1, .., N-1]) and can therefore be left out.

* Description of the serialization:

  The metadata must be a valid JSON object including shape of
  the contained tensors as an array with key **"shape"** plus optional
  dimension names with keys **"dim_names"** and ordering of the
  dimensions with key **"permutation"**.

  - Example: ``{ "shape": [2, 5]}``
  - Example with ``dim_names`` metadata for NCHW ordered data:

    ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``

  - Example of permuted 3-dimensional tensor:

    ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``

.. note::

  Elements in a fixed shape tensor extension array are stored
  in row-major/C-contiguous order.


* The specification is submitted as a PR [1] to Canonical Extension Types
document under the
   format specifications directory [2].

There are also two implementations submitted to Apache Arrow repository:
* C++ implementation of the proposed specification [3]
* Python example implementation of the proposed specification and usage
(only illustrative) [4]


The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...


Regards, Alenka

[1]: https://github.com/apache/arrow/pull/33925/files
[2]:
https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst

[3]: https://github.com/apache/arrow/pull/8510/files
[4]: https://github.com/apache/arrow/pull/33948/files

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Alenka Frim <al...@voltrondata.com.INVALID>.
No problem Kevin. Thank you for sharing the information with your
colleagues.
All comments are much appreciated.

As there were no additional comments/suggestions to the spec itself, I will
open up another voting thread today.

Thanks all!
Alenka

On Tue, Feb 28, 2023 at 11:11 AM Kevin Gurney <kg...@mathworks.com> wrote:

> Hi Alenka,
>
> Thank you. I've informed my colleagues at MathWorks to add any further
> comments to the PR.
>
> My apologies for bringing this up on the voting thread.
>
> Best Regards,
>
> Kevin Gurney
>
> ________________________________
> From: Alenka Frim <al...@voltrondata.com.INVALID>
> Sent: Tuesday, February 28, 2023 4:19 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
>
> This was actually already meant as the voting thread, but given it sparked
> some more discussion, let's give this a few more days, and then re-start
> with a new vote thread.
>
> *So if someone still has comments on the current text, please bring those
> up here or in the PR*: https://github.com/apache/arrow/pull/33925<
> https://github.com/apache/arrow/pull/33925>.
>
> Alenka
>
> On Fri, Feb 24, 2023 at 10:15 AM Kevin Gurney <kg...@mathworks.com>
> wrote:
>
> > Hi All,
> >
> > Thank you very much for creating this proposal, Alenka!
> >
> > I noticed the following in the notes [1] shared from the February 15th
> > Arrow Community Meeting:
> >
> > "Members of Hugging Face, Ray, and PyTorch community have given input and
> > some of it was incorporated - It would be good to have input from some
> > other companies and project communities including Lance, NumPy, Posit,
> > ​MATLAB, DLPack, CUDA/RAPIDS, Arrow Rust, Xarray, Julia, Fortran,
> > TensorFlow, LinkedIn"
> >
> > Based on the inclusion of MATLAB in the list above, I've shared this
> > proposal with some colleagues at MathWorks who have expertise in the deep
> > learning area. They will respond here if they have any additional input
> to
> > add.
> >
> > That being said, I recognize that this proposal is already nearing the
> > voting phase.
> >
> > [1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn<
> https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn>
> >
> > Best Regards,
> >
> > Kevin Gurney
> >
> > ________________________________
> > From: Rok Mihevc <ro...@gmail.com>
> > Sent: Thursday, February 23, 2023 8:12 AM
> > To: dev@arrow.apache.org <de...@arrow.apache.org>
> > Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
> >
> > That makes sense indeed.
> > Do we have any more comments on the language of the proposal [1] or
> should
> > we proceed to vote?
> >
> > Rok
> >
> > [1] https://github.com/apache/arrow/pull/33925/files<
> https://github.com/apache/arrow/pull/33925/files><
> > https://github.com/apache/arrow/pull/33925/files<
> https://github.com/apache/arrow/pull/33925/files>>
> >
> > On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > That's a good point.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> > > > I don't think having both dimension names and permutation is
> > > > redundant...dimension names can also serve as human-readable tags
> that
> > > help
> > > > a human interpret the values. If reading a NetCDF, for example, one
> > might
> > > > store the dimension variable names. When determining type equality it
> > may
> > > > be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H",
> > > "W"]}
> > > > is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y",
> > > "z"]}.
> > > >
> > > > On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com>
> > wrote:
> > > >
> > > >>>
> > > >>>>>
> > > >>>>> Should we rule that `dim_names` and `permutation` are mutually
> > > >>> exclusive?
> > > >>>>>
> > > >>>>
> > > >>>> Since `dim_names` have to "map to the physical layout (row-major)"
> > > that
> > > >>>> means permutation will always be trivial which indeed makes it
> > > >>> unnecessary
> > > >>>> to store both.
> > > >>>
> > > >>> I don't think it is necessarily needed to explicitly make them
> > > >>> mutually exclusive. I don't know how useful this would in practice,
> > > >>> but you certainly *can* specify both in a meaningful way. Re-using
> > the
> > > >>> example of NHWC data, which is physically stored as NCHW, you can
> > keep
> > > >>> track of this by specifying a permutation of [2, 0, 1], but at the
> > > >>> same time you could also still save the dimension names as ["C",
> "H",
> > > >>> "W"].
> > > >>>
> > > >>
> > > >> I'll advocate for the original comment, but I'm ok either way.
> Having
> > > both
> > > >> `dim_names` and `permutation` is redundant - if the user knows their
> > > >> desired order of `dim_names` they can derive the permutation. If
> they
> > > don't
> > > >> use `dim_names` they probably don't want them.
> > > >>
> > > >
> > >
> >
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Kevin Gurney <kg...@mathworks.com>.
Hi Alenka,

Thank you. I've informed my colleagues at MathWorks to add any further comments to the PR.

My apologies for bringing this up on the voting thread.

Best Regards,

Kevin Gurney

________________________________
From: Alenka Frim <al...@voltrondata.com.INVALID>
Sent: Tuesday, February 28, 2023 4:19 AM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

This was actually already meant as the voting thread, but given it sparked
some more discussion, let's give this a few more days, and then re-start
with a new vote thread.

*So if someone still has comments on the current text, please bring those
up here or in the PR*: https://github.com/apache/arrow/pull/33925<https://github.com/apache/arrow/pull/33925>.

Alenka

On Fri, Feb 24, 2023 at 10:15 AM Kevin Gurney <kg...@mathworks.com> wrote:

> Hi All,
>
> Thank you very much for creating this proposal, Alenka!
>
> I noticed the following in the notes [1] shared from the February 15th
> Arrow Community Meeting:
>
> "Members of Hugging Face, Ray, and PyTorch community have given input and
> some of it was incorporated - It would be good to have input from some
> other companies and project communities including Lance, NumPy, Posit,
> ​MATLAB, DLPack, CUDA/RAPIDS, Arrow Rust, Xarray, Julia, Fortran,
> TensorFlow, LinkedIn"
>
> Based on the inclusion of MATLAB in the list above, I've shared this
> proposal with some colleagues at MathWorks who have expertise in the deep
> learning area. They will respond here if they have any additional input to
> add.
>
> That being said, I recognize that this proposal is already nearing the
> voting phase.
>
> [1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn<https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn>
>
> Best Regards,
>
> Kevin Gurney
>
> ________________________________
> From: Rok Mihevc <ro...@gmail.com>
> Sent: Thursday, February 23, 2023 8:12 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
>
> That makes sense indeed.
> Do we have any more comments on the language of the proposal [1] or should
> we proceed to vote?
>
> Rok
>
> [1] https://github.com/apache/arrow/pull/33925/files<https://github.com/apache/arrow/pull/33925/files><
> https://github.com/apache/arrow/pull/33925/files<https://github.com/apache/arrow/pull/33925/files>>
>
> On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > That's a good point.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> > > I don't think having both dimension names and permutation is
> > > redundant...dimension names can also serve as human-readable tags that
> > help
> > > a human interpret the values. If reading a NetCDF, for example, one
> might
> > > store the dimension variable names. When determining type equality it
> may
> > > be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H",
> > "W"]}
> > > is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y",
> > "z"]}.
> > >
> > > On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com>
> wrote:
> > >
> > >>>
> > >>>>>
> > >>>>> Should we rule that `dim_names` and `permutation` are mutually
> > >>> exclusive?
> > >>>>>
> > >>>>
> > >>>> Since `dim_names` have to "map to the physical layout (row-major)"
> > that
> > >>>> means permutation will always be trivial which indeed makes it
> > >>> unnecessary
> > >>>> to store both.
> > >>>
> > >>> I don't think it is necessarily needed to explicitly make them
> > >>> mutually exclusive. I don't know how useful this would in practice,
> > >>> but you certainly *can* specify both in a meaningful way. Re-using
> the
> > >>> example of NHWC data, which is physically stored as NCHW, you can
> keep
> > >>> track of this by specifying a permutation of [2, 0, 1], but at the
> > >>> same time you could also still save the dimension names as ["C", "H",
> > >>> "W"].
> > >>>
> > >>
> > >> I'll advocate for the original comment, but I'm ok either way. Having
> > both
> > >> `dim_names` and `permutation` is redundant - if the user knows their
> > >> desired order of `dim_names` they can derive the permutation. If they
> > don't
> > >> use `dim_names` they probably don't want them.
> > >>
> > >
> >
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Alenka Frim <al...@voltrondata.com.INVALID>.
This was actually already meant as the voting thread, but given it sparked
some more discussion, let's give this a few more days, and then re-start
with a new vote thread.

*So if someone still has comments on the current text, please bring those
up here or in the PR*: https://github.com/apache/arrow/pull/33925.

Alenka

On Fri, Feb 24, 2023 at 10:15 AM Kevin Gurney <kg...@mathworks.com> wrote:

> Hi All,
>
> Thank you very much for creating this proposal, Alenka!
>
> I noticed the following in the notes [1] shared from the February 15th
> Arrow Community Meeting:
>
> "Members of Hugging Face, Ray, and PyTorch community have given input and
> some of it was incorporated - It would be good to have input from some
> other companies and project communities including Lance, NumPy, Posit,
> ​MATLAB, DLPack, CUDA/RAPIDS, Arrow Rust, Xarray, Julia, Fortran,
> TensorFlow, LinkedIn"
>
> Based on the inclusion of MATLAB in the list above, I've shared this
> proposal with some colleagues at MathWorks who have expertise in the deep
> learning area. They will respond here if they have any additional input to
> add.
>
> That being said, I recognize that this proposal is already nearing the
> voting phase.
>
> [1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn
>
> Best Regards,
>
> Kevin Gurney
>
> ________________________________
> From: Rok Mihevc <ro...@gmail.com>
> Sent: Thursday, February 23, 2023 8:12 AM
> To: dev@arrow.apache.org <de...@arrow.apache.org>
> Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type
>
> That makes sense indeed.
> Do we have any more comments on the language of the proposal [1] or should
> we proceed to vote?
>
> Rok
>
> [1] https://github.com/apache/arrow/pull/33925/files<
> https://github.com/apache/arrow/pull/33925/files>
>
> On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > That's a good point.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> > > I don't think having both dimension names and permutation is
> > > redundant...dimension names can also serve as human-readable tags that
> > help
> > > a human interpret the values. If reading a NetCDF, for example, one
> might
> > > store the dimension variable names. When determining type equality it
> may
> > > be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H",
> > "W"]}
> > > is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y",
> > "z"]}.
> > >
> > > On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com>
> wrote:
> > >
> > >>>
> > >>>>>
> > >>>>> Should we rule that `dim_names` and `permutation` are mutually
> > >>> exclusive?
> > >>>>>
> > >>>>
> > >>>> Since `dim_names` have to "map to the physical layout (row-major)"
> > that
> > >>>> means permutation will always be trivial which indeed makes it
> > >>> unnecessary
> > >>>> to store both.
> > >>>
> > >>> I don't think it is necessarily needed to explicitly make them
> > >>> mutually exclusive. I don't know how useful this would in practice,
> > >>> but you certainly *can* specify both in a meaningful way. Re-using
> the
> > >>> example of NHWC data, which is physically stored as NCHW, you can
> keep
> > >>> track of this by specifying a permutation of [2, 0, 1], but at the
> > >>> same time you could also still save the dimension names as ["C", "H",
> > >>> "W"].
> > >>>
> > >>
> > >> I'll advocate for the original comment, but I'm ok either way. Having
> > both
> > >> `dim_names` and `permutation` is redundant - if the user knows their
> > >> desired order of `dim_names` they can derive the permutation. If they
> > don't
> > >> use `dim_names` they probably don't want them.
> > >>
> > >
> >
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Kevin Gurney <kg...@mathworks.com>.
Hi All,

Thank you very much for creating this proposal, Alenka!

I noticed the following in the notes [1] shared from the February 15th Arrow Community Meeting:

"Members of Hugging Face, Ray, and PyTorch community have given input and some of it was incorporated - It would be good to have input from some other companies and project communities including Lance, NumPy, Posit, ​MATLAB, DLPack, CUDA/RAPIDS, Arrow Rust, Xarray, Julia, Fortran, TensorFlow, LinkedIn"

Based on the inclusion of MATLAB in the list above, I've shared this proposal with some colleagues at MathWorks who have expertise in the deep learning area. They will respond here if they have any additional input to add.

That being said, I recognize that this proposal is already nearing the voting phase.

[1] https://lists.apache.org/thread/bblcwwq7gl1x2hsr1qsormv9f3vr23jn

Best Regards,

Kevin Gurney

________________________________
From: Rok Mihevc <ro...@gmail.com>
Sent: Thursday, February 23, 2023 8:12 AM
To: dev@arrow.apache.org <de...@arrow.apache.org>
Subject: Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

That makes sense indeed.
Do we have any more comments on the language of the proposal [1] or should
we proceed to vote?

Rok

[1] https://github.com/apache/arrow/pull/33925/files<https://github.com/apache/arrow/pull/33925/files>

On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou <an...@python.org> wrote:

>
> That's a good point.
>
> Regards
>
> Antoine.
>
>
> Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> > I don't think having both dimension names and permutation is
> > redundant...dimension names can also serve as human-readable tags that
> help
> > a human interpret the values. If reading a NetCDF, for example, one might
> > store the dimension variable names. When determining type equality it may
> > be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H",
> "W"]}
> > is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y",
> "z"]}.
> >
> > On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com> wrote:
> >
> >>>
> >>>>>
> >>>>> Should we rule that `dim_names` and `permutation` are mutually
> >>> exclusive?
> >>>>>
> >>>>
> >>>> Since `dim_names` have to "map to the physical layout (row-major)"
> that
> >>>> means permutation will always be trivial which indeed makes it
> >>> unnecessary
> >>>> to store both.
> >>>
> >>> I don't think it is necessarily needed to explicitly make them
> >>> mutually exclusive. I don't know how useful this would in practice,
> >>> but you certainly *can* specify both in a meaningful way. Re-using the
> >>> example of NHWC data, which is physically stored as NCHW, you can keep
> >>> track of this by specifying a permutation of [2, 0, 1], but at the
> >>> same time you could also still save the dimension names as ["C", "H",
> >>> "W"].
> >>>
> >>
> >> I'll advocate for the original comment, but I'm ok either way. Having
> both
> >> `dim_names` and `permutation` is redundant - if the user knows their
> >> desired order of `dim_names` they can derive the permutation. If they
> don't
> >> use `dim_names` they probably don't want them.
> >>
> >
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Rok Mihevc <ro...@gmail.com>.
That makes sense indeed.
Do we have any more comments on the language of the proposal [1] or should
we proceed to vote?

Rok

[1] https://github.com/apache/arrow/pull/33925/files

On Wed, Feb 22, 2023 at 2:13 PM Antoine Pitrou <an...@python.org> wrote:

>
> That's a good point.
>
> Regards
>
> Antoine.
>
>
> Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> > I don't think having both dimension names and permutation is
> > redundant...dimension names can also serve as human-readable tags that
> help
> > a human interpret the values. If reading a NetCDF, for example, one might
> > store the dimension variable names. When determining type equality it may
> > be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H",
> "W"]}
> > is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y",
> "z"]}.
> >
> > On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com> wrote:
> >
> >>>
> >>>>>
> >>>>> Should we rule that `dim_names` and `permutation` are mutually
> >>> exclusive?
> >>>>>
> >>>>
> >>>> Since `dim_names` have to "map to the physical layout (row-major)"
> that
> >>>> means permutation will always be trivial which indeed makes it
> >>> unnecessary
> >>>> to store both.
> >>>
> >>> I don't think it is necessarily needed to explicitly make them
> >>> mutually exclusive. I don't know how useful this would in practice,
> >>> but you certainly *can* specify both in a meaningful way. Re-using the
> >>> example of NHWC data, which is physically stored as NCHW, you can keep
> >>> track of this by specifying a permutation of [2, 0, 1], but at the
> >>> same time you could also still save the dimension names as ["C", "H",
> >>> "W"].
> >>>
> >>
> >> I'll advocate for the original comment, but I'm ok either way. Having
> both
> >> `dim_names` and `permutation` is redundant - if the user knows their
> >> desired order of `dim_names` they can derive the permutation. If they
> don't
> >> use `dim_names` they probably don't want them.
> >>
> >
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Antoine Pitrou <an...@python.org>.
That's a good point.

Regards

Antoine.


Le 22/02/2023 à 14:11, Dewey Dunnington a écrit :
> I don't think having both dimension names and permutation is
> redundant...dimension names can also serve as human-readable tags that help
> a human interpret the values. If reading a NetCDF, for example, one might
> store the dimension variable names. When determining type equality it may
> be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H", "W"]}
> is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y", "z"]}.
> 
> On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com> wrote:
> 
>>>
>>>>>
>>>>> Should we rule that `dim_names` and `permutation` are mutually
>>> exclusive?
>>>>>
>>>>
>>>> Since `dim_names` have to "map to the physical layout (row-major)" that
>>>> means permutation will always be trivial which indeed makes it
>>> unnecessary
>>>> to store both.
>>>
>>> I don't think it is necessarily needed to explicitly make them
>>> mutually exclusive. I don't know how useful this would in practice,
>>> but you certainly *can* specify both in a meaningful way. Re-using the
>>> example of NHWC data, which is physically stored as NCHW, you can keep
>>> track of this by specifying a permutation of [2, 0, 1], but at the
>>> same time you could also still save the dimension names as ["C", "H",
>>> "W"].
>>>
>>
>> I'll advocate for the original comment, but I'm ok either way. Having both
>> `dim_names` and `permutation` is redundant - if the user knows their
>> desired order of `dim_names` they can derive the permutation. If they don't
>> use `dim_names` they probably don't want them.
>>
> 

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.
I don't think having both dimension names and permutation is
redundant...dimension names can also serve as human-readable tags that help
a human interpret the values. If reading a NetCDF, for example, one might
store the dimension variable names. When determining type equality it may
be useful that {..., permutation = [2, 0, 1], dim_names = ["C", "H", "W"]}
is not equal to {..., permutation = [2, 0, 1], dim_names = ["x", "y", "z"]}.

On Wed, Feb 22, 2023 at 4:56 AM Rok Mihevc <ro...@gmail.com> wrote:

> >
> > > >
> > > > Should we rule that `dim_names` and `permutation` are mutually
> > exclusive?
> > > >
> > >
> > > Since `dim_names` have to "map to the physical layout (row-major)" that
> > > means permutation will always be trivial which indeed makes it
> > unnecessary
> > > to store both.
> >
> > I don't think it is necessarily needed to explicitly make them
> > mutually exclusive. I don't know how useful this would in practice,
> > but you certainly *can* specify both in a meaningful way. Re-using the
> > example of NHWC data, which is physically stored as NCHW, you can keep
> > track of this by specifying a permutation of [2, 0, 1], but at the
> > same time you could also still save the dimension names as ["C", "H",
> > "W"].
> >
>
> I'll advocate for the original comment, but I'm ok either way. Having both
> `dim_names` and `permutation` is redundant - if the user knows their
> desired order of `dim_names` they can derive the permutation. If they don't
> use `dim_names` they probably don't want them.
>

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Rok Mihevc <ro...@gmail.com>.
>
> > >
> > > Should we rule that `dim_names` and `permutation` are mutually
> exclusive?
> > >
> >
> > Since `dim_names` have to "map to the physical layout (row-major)" that
> > means permutation will always be trivial which indeed makes it
> unnecessary
> > to store both.
>
> I don't think it is necessarily needed to explicitly make them
> mutually exclusive. I don't know how useful this would in practice,
> but you certainly *can* specify both in a meaningful way. Re-using the
> example of NHWC data, which is physically stored as NCHW, you can keep
> track of this by specifying a permutation of [2, 0, 1], but at the
> same time you could also still save the dimension names as ["C", "H",
> "W"].
>

I'll advocate for the original comment, but I'm ok either way. Having both
`dim_names` and `permutation` is redundant - if the user knows their
desired order of `dim_names` they can derive the permutation. If they don't
use `dim_names` they probably don't want them.

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Tue, 21 Feb 2023 at 18:00, Rok Mihevc <ro...@gmail.com> wrote:
>
> >
> > Should we rule that `dim_names` and `permutation` are mutually exclusive?
> >
>
> Since `dim_names` have to "map to the physical layout (row-major)" that
> means permutation will always be trivial which indeed makes it unnecessary
> to store both.

I don't think it is necessarily needed to explicitly make them
mutually exclusive. I don't know how useful this would in practice,
but you certainly *can* specify both in a meaningful way. Re-using the
example of NHWC data, which is physically stored as NCHW, you can keep
track of this by specifying a permutation of [2, 0, 1], but at the
same time you could also still save the dimension names as ["C", "H",
"W"].

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Rok Mihevc <ro...@gmail.com>.
>
> Should we rule that `dim_names` and `permutation` are mutually exclusive?
>

Since `dim_names` have to "map to the physical layout (row-major)" that
means permutation will always be trivial which indeed makes it unnecessary
to store both.
(This makes me think about extension type implementations - do we want to
offer an API to write/read an arbitrary order and handle that logic or do
we leave that complexity to the user)

Rok

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Antoine Pitrou <an...@python.org>.
Hi Alenka,

Le 21/02/2023 à 13:38, Alenka Frim a écrit :
> 
> Fixed shape tensor
> ==================
> 
> * Extension name: `arrow.fixed_shape_tensor`.
> 
> * The storage type of the extension: ``FixedSizeList`` where:
> 
>    * **value_type** is the data type of individual tensors and
>      is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.

I would say "the data type of individual tensor elements".
(so that people don't try to make it e.g. List(float64)).

Also, I don't think any reference to pyarrow should be made here.

>    * **list_size** is the product of all the elements in tensor shape.
> 
> * Extension type parameters:
> 
>    * **value_type** = Arrow DataType of the tensor elements
>    * **shape** = shape of the contained tensors as an array

I would say the "the physical shape" to make it clear it refers to how 
values are laid out in memory, while `dim_names` and `permutation` drive 
the logical interpretation.

>    Optional parameters:
> 
>    * **dim_names** = explicit names to tensor dimensions
>      as an array. The length of it should be equal to the shape
>      length and equal to the number of dimensions.
> 
>      ``dim_names`` can be used if the dimensions have well-known
>      names and they map to the physical layout (row-major).
> 
>    * **permutation**  = indices of the desired ordering of the
>      original dimensions, defined as an array.
> 
>      The indices contain a permutation of the values [0, 1, .., N-1] where
>      N is the number of dimensions. The permutation indicates which
>      dimension of the logical layout corresponds to which dimension of the
>      physical tensor (the i-th dimension of the logical view corresponds
>      to the dimension with number ``permutations[i]`` of the physical tensor).
> 
>      Permutation can be useful in case the logical order of
>      the tensor is a permutation of the physical order (row-major).
> 
>      When logical and physical layout are equal, the permutation will always
>      be ([0, 1, .., N-1]) and can therefore be left out.

Should we rule that `dim_names` and `permutation` are mutually exclusive?

> * Description of the serialization:
> 
>    The metadata must be a valid JSON object including shape of
>    the contained tensors as an array with key **"shape"** plus optional
>    dimension names with keys **"dim_names"** and ordering of the
>    dimensions with key **"permutation"**.
> 
>    - Example: ``{ "shape": [2, 5]}``
>    - Example with ``dim_names`` metadata for NCHW ordered data:
> 
>      ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``
> 
>    - Example of permuted 3-dimensional tensor:
> 
>      ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``

Perhaps explain in this example that the logical shape is [500, 100, 200]?
(if I understand `permutation` correctly)

Regards

Antoine.

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Alenka Frim <al...@voltrondata.com.INVALID>.
> I would say "the data type of individual tensor elements".
> (so that people don't try to make it e.g. List(float64)).


Also, I don't think any reference to pyarrow should be made here.


Good catch! I have updated the text with:

  * **value_type** is the data type of individual tensor elements
    and is an instance of Arrow ``DataType`` or ``Field``.

I would say the "the physical shape" to make it clear it refers to how
> values are laid out in memory, while `dim_names` and `permutation` drive
> the logical interpretation.


Have updated the description of the shape and added logical layout to the
optional
parameters text::

* Extension type parameters:


  * **value_type** = Arrow DataType or Field of the tensor elements.
  * **shape** = the physical shape of the contained tensors
    as an array.


  Optional parameters describing the logical layout:


Perhaps explain in this example that the logical shape is [500, 100, 200]?
> (if I understand `permutation` correctly)


Updated the text with:

  - Example of permuted 3-dimensional tensor:


    ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``


    This is the physical layout shape and the the shape of the logical

    layout would in this case be ``[500, 100, 200]``.


+1! I put together a quick R implementation as well to see how the
> permutation field fits with our native column-major storage [1]. It worked
> great! Thank you for all of your work assembling all of our collective
> opinions on this :-)
>

That is great to hear! Thank you so much for your input Dewey, it helped to
understand
the R side of things much better.

The updated version of the specification can be found here:
https://github.com/apache/arrow/pull/33925/files

All well,
Alenka

Re: [VOTE] Format: Fixed shape tensor Canonical Extension Type

Posted by Dewey Dunnington <de...@voltrondata.com.INVALID>.
+1! I put together a quick R implementation as well to see how the
permutation field fits with our native column-major storage [1]. It worked
great! Thank you for all of your work assembling all of our collective
opinions on this :-)

[1] https://gist.github.com/paleolimbot/c42f068c2b8b98255dbfbe379d905607

On Tue, Feb 21, 2023 at 8:39 AM Alenka Frim <al...@voltrondata.com.invalid>
wrote:

> Hi all,
>
> I would like to propose we vote on adding the fixed shape tensor canonical
> extension type
> with the following specification:
>
> Fixed shape tensor
> ==================
>
> * Extension name: `arrow.fixed_shape_tensor`.
>
> * The storage type of the extension: ``FixedSizeList`` where:
>
>   * **value_type** is the data type of individual tensors and
>     is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
>   * **list_size** is the product of all the elements in tensor shape.
>
> * Extension type parameters:
>
>   * **value_type** = Arrow DataType of the tensor elements
>   * **shape** = shape of the contained tensors as an array
>
>   Optional parameters:
>
>   * **dim_names** = explicit names to tensor dimensions
>     as an array. The length of it should be equal to the shape
>     length and equal to the number of dimensions.
>
>     ``dim_names`` can be used if the dimensions have well-known
>     names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>     original dimensions, defined as an array.
>
>     The indices contain a permutation of the values [0, 1, .., N-1] where
>     N is the number of dimensions. The permutation indicates which
>     dimension of the logical layout corresponds to which dimension of the
>     physical tensor (the i-th dimension of the logical view corresponds
>     to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
>     Permutation can be useful in case the logical order of
>     the tensor is a permutation of the physical order (row-major).
>
>     When logical and physical layout are equal, the permutation will always
>     be ([0, 1, .., N-1]) and can therefore be left out.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including shape of
>   the contained tensors as an array with key **"shape"** plus optional
>   dimension names with keys **"dim_names"** and ordering of the
>   dimensions with key **"permutation"**.
>
>   - Example: ``{ "shape": [2, 5]}``
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
>     ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``
>
>   - Example of permuted 3-dimensional tensor:
>
>     ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``
>
> .. note::
>
>   Elements in a fixed shape tensor extension array are stored
>   in row-major/C-contiguous order.
>
>
> * The specification is submitted as a PR [1] to Canonical Extension Types
> document under the
>    format specifications directory [2].
>
> There are also two implementations submitted to Apache Arrow repository:
> * C++ implementation of the proposed specification [3]
> * Python example implementation of the proposed specification and usage
> (only illustrative) [4]
>
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Accept this proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
>
> Regards, Alenka
>
> [1]: https://github.com/apache/arrow/pull/33925/files
> [2]:
>
> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
>
> [3]: https://github.com/apache/arrow/pull/8510/files
> [4]: https://github.com/apache/arrow/pull/33948/files
>