You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/07/01 19:37:34 UTC

Re: RecordBatch with Tensors/Arrays

hi Andrew,

I'm copying dev@ just so more folks are in the loop

On Wed, Jun 19, 2019 at 9:13 AM Andrew Spott <an...@gmail.com> wrote:
>
> I was told to post this here, rather than as an issue on Github.
>
> ====
>
> I'm looking to serialize data that looks something like this:
>
> ```
> record<n1> = { "predicted": <tensor with shape n1, m>,
>                           "truth": <tensor with shape n1, m>,
>                           "loss": <double>,
>                           "index": <array with shape n1>}
>
> data = [
>     pa.array([record<n1>, record<n2>, record<n3>]),
>     pa.array([<float>, <float>, <float>])
>     pa.array([<float>, <float>, <float>])
> ]
>
> batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2'])
> ```
>
> But I'm not sure how to do that, or even if what I'm trying to do is the right way to do it.

We don't support tensors/ndarrays as first-class value types in the
Python or C++ libraries. This could be done hypothetically using the
new ExtensionType facility. Tensor values would be embedded in an
Arrow Binary column.

There is already ARROW-1614 open for this. I also opened ARROW-5819
about implementing the Python-side plumbing around this

Another possible option is to infer list<...> types from ndarrays
(e.g. list<list<double>> from an ndarray of ndim=2 and dtype=float64),
but this has not been implemented.

>
> What is the difference between `pa.array` and `pa.list_`?  This formulation is an array of structs, but is the struct of arrays formulation of this possible? i.e.:
>

* The return value of pa.array is an Array object, which wraps the C++
arrow::Array type, the base class for value sequences. It's data, not
metadata
* pa.list_ returns an instance of ListType, which is a DataType
subclass. It's metadata, not data

> ```
> data = [
>     pa.array([ <tensor with shape n1, m>,  <tensor with shape n2, m>,  <tensor with shape n3, m>]),
>     pa.array([ <tensor with shape n1, m>,  <tensor with shape n2, m>,  <tensor with shape n3, m>]),
>     pa.array([<float>, <float>, <float>]),
> ...
> ]
> ```
>
> Which doesn't currently work.  It seems that there is a separation between '1d arraylike' datatypes and 'pythonlike' datatypes (and 'nd arraylike' datatypes), so I can't have a struct of an array.
>

Right. ndarrays as array cell values are not natively part of the
Arrow columnar format. But they could be supported through extensions.
This would be a nice project for someone to take on in the future

- Wes

> -Andrew