You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alenka Frim <al...@voltrondata.com.INVALID> on 2023/05/04 09:52:29 UTC

Re: Arrow community meeting April 26 at 16:00 UTC

Hi all,

I just wanted to chime in with the tensor discussion happening last week on
the Arrow
community meeting call.

Questions about usage of the new fixed-shape tensor canonical extension
> type [6]
> - Can it be written to a Parquet file and read back in? If so, what
> Parquet logical and physical types does it use?
>

With Arrow (PyArrow in the example) it can be written to a Parquet file and
bak in, the
type used in Parquet seems to be a List:

import pyarrow as pa

tensor_type = pa.fixed_shape_tensor(pa.int32(), (2, 2))
arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
storage = pa.array(arr, pa.list_(pa.int32(), 4))
tensor_array = pa.ExtensionArray.from_storage(tensor_type, storage)

data = [
pa.array([1, 2, 3]),
pa.array(['foo', 'bar', None]),
pa.array([True, None, True]),
tensor_array,
]

my_schema = pa.schema([('f0', pa.int8()),
('f1', pa.string()),
('f2', pa.bool_()),
('tensors_int', tensor_type)])

table = pa.Table.from_arrays(data, schema=my_schema)

import pyarrow.parquet as pq
pq.write_table(table, 'example_tensor.parquet')

pq.read_table('example_tensor.parquet')
# pyarrow.Table
# f0: int8
# f1: string
# f2: bool
# tensors_int: extension<arrow.fixed_shape_tensor>
# ----
# f0: [[1,2,3]]
# f1: [["foo","bar",null]]
# f2: [[true,null,true]]
# tensors_int: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]

pq.read_metadata('example_tensor.parquet')
# <pyarrow._parquet.FileMetaData object at 0x125b20270>
# created_by: parquet-cpp-arrow version 12.0.0-SNAPSHOT
# num_columns: 4
# num_rows: 3
# num_row_groups: 1
# format_version: 2.6
# serialized_size: 1164
pq.read_metadata('example_tensor.parquet').schema
# <pyarrow._parquet.ParquetSchema object at 0x125ae2280>
# required group field_id=-1 schema {
# optional int32 field_id=-1 f0 (Int(bitWidth=8, isSigned=true));
# optional binary field_id=-1 f1 (String);
# optional boolean field_id=-1 f2;
# optional group field_id=-1 tensors_int (List) {
# repeated group field_id=-1 list {
# optional int32 field_id=-1 item;
# }
# }
# }



> - Is it recommended for use with image data, or should we use byte
> arrays instead?


That depends on the use case. With a fixed-shape tensor you can access
individual image
data (pixels). Byte arrays will probably perform better when reading and
writing to a Parquet
file (avoiding repetitions, not tested though) but will also need some
custom logic to get
individual image data if needed.

Hope this information helps.

Best,
Alenka

On Thu, Apr 27, 2023 at 10:47 PM Ian Cook <ia...@ursacomputing.com> wrote:

> Below is a summary of the notes from yesterday's meeting:
>
> Attendees:
>
> - Ian Cook
> - Raúl Cumplido
> - Xuwei Fu
> - Will Jones
> - Bryce Mecum
> - Rok Mihevc
> - Sri Nadukudy
> - Matthew Topol
>
>
> Discussion:
>
> Arrow 12.0.0 release
> - RC0 has been proposed [1]
> - There were a lot of CI failures at the time of the code freeze so it
> took longer than usual to resolve these and generate RC0; thanks to
> everyone who helped
> - There is one outstanding question regarding an issue with pandas
> 2.0.1 [2] and there is a fix that skips the failing test [3]
> - It is unclear whether we should create a new RC that skips this
> test, or whether it is sufficient to release the current RC since
> pandas will fix the issue on their end
> - There are a couple of other minor issues that we don’t think are blockers
>
>
> Support for non-CPU memory in Arrow C data interface [4][5]
> - We are seeking input that addresses the questions posed and gives
> concrete recommendations
>
>
> Questions about usage of the new fixed-shape tensor canonical extension
> type [6]
> - Can it be written to a Parquet file and read back in? If so, what
> Parquet logical and physical types does it use?
> - Is it recommended for use with image data, or should we use byte
> arrays instead?
>
>
> Status of proposed integration tests for C data interface [7]
> - Has not yet been implemented
>
>
> Suggested topics for next meeting
> - Discuss priorities for Arrow 13.0.0 release
>
>
> [1] https://lists.apache.org/thread/2cnl1nbr8kfcxxq9s9br9b6f4xpmsqz1
> [2] https://github.com/pandas-dev/pandas/issues/52899
> [3] https://github.com/apache/arrow/pull/35324
> [4] https://github.com/apache/arrow/pull/34972
> [5] https://lists.apache.org/thread/sntc3pp6msdvb94zhq2lvy70s1p6d1qg
> [6]
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list
> [7] https://lists.apache.org/thread/nr05xwls713xpsxkobpln2f6wsdntrky
>
>
> On Tue, Apr 25, 2023 at 3:54 PM Ian Cook <ia...@ursacomputing.com> wrote:
> >
> > Hi all,
> >
> > Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00
> EDT.
> >
> > Zoom meeting URL:
> > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > Meeting ID: 876 4903 3008
> > Passcode: 958092
> >
> > The notes for this and future instances of this meeting will be
> > captured in this Google Doc:
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> > If you plan to attend this meeting, you are welcome to edit the
> > document to add the topics that you would like to discuss.
> >
> > Thanks,
> > Ian
>