You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Leif Walsh <le...@gmail.com> on 2018/04/09 21:22:48 UTC

Tensor column types in arrow

Hi all,

I’ve been doing some work lately with Spark’s ML interfaces, which include
sparse and dense Vector and Matrix types, backed on the Scala side by
Breeze. Using these interfaces, you can construct DataFrames whose column
types are vectors and matrices, and though the API isn’t terribly rich, it
is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
out of each row. However, if you’re using Spark’s Arrow serialization
between the executor and Python workers, you get this
UnsupportedOperationException:
https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala#L71

I think it would be useful for Arrow to support something like a column of
tensors, and I’d like to see if anyone else here is interested in such a
thing. If so, I’d like to propose adding it to the spec and getting it
implemented in at least Java and C++/Python.

Some initial mildly-scattered thoughts:

1. You can certainly represent these today as List<Double> and
List<List<Double>>, but then need to do some copying to get them back into
numpy ndarrays.

2. In some cases it might be useful to know that a column contains 3x3x4
tensors, for example, and not just that there are three dimensions as you’d
get with List<List<List<Double>>>. This could constrain what operations
are meaningful (for example, in Spark you could imagine type checking that
verifies dimension alignment for matrix multiplication).

3. You could approximate that with a FixedSizeList and metadata about the
tensor shape.

4. But I kind of feel like this is generally useful enough that it’s worth
having one implementation of it (well, one for each runtime) in Arrow.

5. Or, maybe everyone here thinks Spark should just do this with metadata?

Curious to hear what you all think.

Thanks,
Leif

--
--
Cheers,
Leif

Re: Tensor column types in arrow

Posted by Leif Walsh <le...@gmail.com>.

Thanks, I’ll create a jira and google doc. I agree those are the main
questions to iron out.

If there’s a desire to avoid scope creeping this in before 1.0, I think in
parallel I’ll start a conversation with the spark community about using the
existing FixedSizeBinary type plus some custom metadata to provide
serialization for their ML UDTs, and let them know that in the future if
this is added to arrow, they could switch that implementation to use those
arrow types instead.


On Tue, Apr 10, 2018 at 19:18 Wes McKinney <we...@gmail.com> wrote:

> The simplest thing would be to have a "tensor" or "ndarray" type where
> each cell has the same shape. This would amount to adding the current
> "Tensor" Flatbuffers table to the Type union in
>
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>
> The benefit of having each cell having the same shape is that the
> physical representation is FixedSizeBinary.
>
> Some caveats / notes:
>
> * We have a prior unresolved discussion about our approach to logical
> types. I could argue that this might fall into the same bucket of
> logical types. I don't think we should merge any patches related to
> this issue until we resolve that discussion
>
> * Using FixedSizeBinary as the physical representation constrains
> value sizes to 2GB (product of shape) because the FixedSizeBinary
> metadata uses int for the byteWidth. We might consider changing this
> to long (64 bits), but that's a separate discussion
>
> * If we permitted each cell to have a different shape, then we would
> need to use Binary (vs. FixedSizeBinary), which would limit the entire
> size of a column to 2GB of total tensor data. This could be mitigated
> by introducing LargeBinary (64 bit offsets), but this requires
> additional discussion (there is a JIRA about this already from some
> time ago)
>
> Given that we are still falling short of a complete implementation of
> other Arrow types (unions, intervals, fixed size lists), I urge all to
> be deliberate about not piling on more technical debt / format
> implementation shortfall if it can be avoided -- so a solution to this
> might be to have a patch for initial Tensor/Ndarray value support that
> is implemented in Java and/or C++
>
> How about creating a JIRA about this broad topic and creating a Google
> doc with a proposed implementation approach for discussion?
>
> Thanks
> Wes
>
> On Tue, Apr 10, 2018 at 5:48 PM, Li Jin <ic...@gmail.com> wrote:
> > What do people think whether "shape" should be included as a optional
> part
> > of schema metadata or a required part of the schema itself?
> >
> > I feel having it be required might be too restrictive for interop with
> > other systems.
> >
> > On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <le...@gmail.com> wrote:
> >
> >> My gut feeling is that such a column type should specify both the shape
> and
> >> primitive type of all values in the column. I can’t think of a common
> use
> >> case that requires differently shaped tensors in a single column.
> >>
> >> Can anyone here come up with such a use case?
> >>
> >> If not, I can try to draft a proposal change to the spec that adds these
> >> types. The next question is whether such a change can make it in (with
> c++
> >> and java implementations) before 1.0.
> >> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <we...@gmail.com> wrote:
> >>
> >> > > As far as I know, there is an implementation of tensor type in
> >> > C++/Python already. Should we just finalize the spec and add
> >> implementation
> >> > to Java?
> >> >
> >> > There is nothing specified yet as far as a *column* of
> >> > ndarrays/tensors. We defined Tensor metadata for the purposes of
> >> > IPC/serialization but made no effort to incorporate such data into the
> >> > columnar format.
> >> >
> >> > There are likely many ways to implement column whose values are
> >> > ndarrays, each cell with its own shape. Whether we would want to
> >> > permit each cell to have a different ndarray cell type is another
> >> > question (i.e. would we want to constrain every cell in a column to
> >> > contain ndarrays of a particular type, like float64)
> >> >
> >> > So there's a couple of questions
> >> >
> >> > * How to represent the data using the columnar format
> >> > * How to incorporate ndarray metadata into columnar schemas
> >> >
> >> > - Wes
> >> >
> >> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ic...@gmail.com> wrote:
> >> > > As far as I know, there is an implementation of tensor type in
> >> C++/Python
> >> > > already. Should we just finalize the spec and add implementation to
> >> Java?
> >> > >
> >> > > On the Spark side, it's probably more complicated as Vector and
> Matrix
> >> > are
> >> > > not "first class" types in Spark SQL. Spark ML implements them as
> UDT
> >> > > (user-defined types) so it's not clear how to make Spark/Arrow
> >> converter
> >> > > work with them.
> >> > >
> >> > > I wonder if Bryan and Holden have some more thoughts on that?
> >> > >
> >> > > Li
> >> > >
> >> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com>
> >> wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
> >> > include
> >> > >> sparse and dense Vector and Matrix types, backed on the Scala side
> by
> >> > >> Breeze. Using these interfaces, you can construct DataFrames whose
> >> > column
> >> > >> types are vectors and matrices, and though the API isn’t terribly
> >> rich,
> >> > it
> >> > >> is possible to run Python UDFs over such a DataFrame and get numpy
> >> > ndarrays
> >> > >> out of each row. However, if you’re using Spark’s Arrow
> serialization
> >> > >> between the executor and Python workers, you get this
> >> > >> UnsupportedOperationException:
> >> > >>
> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> >> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> >> > >> execution/arrow/ArrowWriter.scala#L71
> >> > >>
> >> > >> I think it would be useful for Arrow to support something like a
> >> column
> >> > of
> >> > >> tensors, and I’d like to see if anyone else here is interested in
> >> such a
> >> > >> thing.  If so, I’d like to propose adding it to the spec and
> getting
> >> it
> >> > >> implemented in at least Java and C++/Python.
> >> > >>
> >> > >> Some initial mildly-scattered thoughts:
> >> > >>
> >> > >> 1. You can certainly represent these today as List<Double> and
> >> > >> List<List<Double>>, but then need to do some copying to get them
> back
> >> > into
> >> > >> numpy ndarrays.
> >> > >>
> >> > >> 2. In some cases it might be useful to know that a column contains
> >> 3x3x4
> >> > >> tensors, for example, and not just that there are three dimensions
> as
> >> > you’d
> >> > >> get with List<List<List<Double>>>.  This could constrain what
> >> operations
> >> > >> are meaningful (for example, in Spark you could imagine type
> checking
> >> > that
> >> > >> verifies dimension alignment for matrix multiplication).
> >> > >>
> >> > >> 3. You could approximate that with a FixedSizeList and metadata
> about
> >> > the
> >> > >> tensor shape.
> >> > >>
> >> > >> 4. But I kind of feel like this is generally useful enough that
> it’s
> >> > worth
> >> > >> having one implementation of it (well, one for each runtime) in
> Arrow.
> >> > >>
> >> > >> 5. Or, maybe everyone here thinks Spark should just do this with
> >> > metadata?
> >> > >>
> >> > >> Curious to hear what you all think.
> >> > >>
> >> > >> Thanks,
> >> > >> Leif
> >> > >>
> >> > >> --
> >> > >> --
> >> > >> Cheers,
> >> > >> Leif
> >> > >>
> >> >
> >> --
> >> --
> >> Cheers,
> >> Leif
> >>
>
-- 
-- 
Cheers,
Leif

Re: Tensor column types in arrow

Posted by Wes McKinney <we...@gmail.com>.

The simplest thing would be to have a "tensor" or "ndarray" type where
each cell has the same shape. This would amount to adding the current
"Tensor" Flatbuffers table to the Type union in

https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194

The benefit of having each cell having the same shape is that the
physical representation is FixedSizeBinary.

Some caveats / notes:

* We have a prior unresolved discussion about our approach to logical
types. I could argue that this might fall into the same bucket of
logical types. I don't think we should merge any patches related to
this issue until we resolve that discussion

* Using FixedSizeBinary as the physical representation constrains
value sizes to 2GB (product of shape) because the FixedSizeBinary
metadata uses int for the byteWidth. We might consider changing this
to long (64 bits), but that's a separate discussion

* If we permitted each cell to have a different shape, then we would
need to use Binary (vs. FixedSizeBinary), which would limit the entire
size of a column to 2GB of total tensor data. This could be mitigated
by introducing LargeBinary (64 bit offsets), but this requires
additional discussion (there is a JIRA about this already from some
time ago)

Given that we are still falling short of a complete implementation of
other Arrow types (unions, intervals, fixed size lists), I urge all to
be deliberate about not piling on more technical debt / format
implementation shortfall if it can be avoided -- so a solution to this
might be to have a patch for initial Tensor/Ndarray value support that
is implemented in Java and/or C++

How about creating a JIRA about this broad topic and creating a Google
doc with a proposed implementation approach for discussion?

Thanks
Wes

On Tue, Apr 10, 2018 at 5:48 PM, Li Jin <ic...@gmail.com> wrote:
> What do people think whether "shape" should be included as a optional part
> of schema metadata or a required part of the schema itself?
>
> I feel having it be required might be too restrictive for interop with
> other systems.
>
> On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <le...@gmail.com> wrote:
>
>> My gut feeling is that such a column type should specify both the shape and
>> primitive type of all values in the column. I can’t think of a common use
>> case that requires differently shaped tensors in a single column.
>>
>> Can anyone here come up with such a use case?
>>
>> If not, I can try to draft a proposal change to the spec that adds these
>> types. The next question is whether such a change can make it in (with c++
>> and java implementations) before 1.0.
>> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <we...@gmail.com> wrote:
>>
>> > > As far as I know, there is an implementation of tensor type in
>> > C++/Python already. Should we just finalize the spec and add
>> implementation
>> > to Java?
>> >
>> > There is nothing specified yet as far as a *column* of
>> > ndarrays/tensors. We defined Tensor metadata for the purposes of
>> > IPC/serialization but made no effort to incorporate such data into the
>> > columnar format.
>> >
>> > There are likely many ways to implement column whose values are
>> > ndarrays, each cell with its own shape. Whether we would want to
>> > permit each cell to have a different ndarray cell type is another
>> > question (i.e. would we want to constrain every cell in a column to
>> > contain ndarrays of a particular type, like float64)
>> >
>> > So there's a couple of questions
>> >
>> > * How to represent the data using the columnar format
>> > * How to incorporate ndarray metadata into columnar schemas
>> >
>> > - Wes
>> >
>> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ic...@gmail.com> wrote:
>> > > As far as I know, there is an implementation of tensor type in
>> C++/Python
>> > > already. Should we just finalize the spec and add implementation to
>> Java?
>> > >
>> > > On the Spark side, it's probably more complicated as Vector and Matrix
>> > are
>> > > not "first class" types in Spark SQL. Spark ML implements them as UDT
>> > > (user-defined types) so it's not clear how to make Spark/Arrow
>> converter
>> > > work with them.
>> > >
>> > > I wonder if Bryan and Holden have some more thoughts on that?
>> > >
>> > > Li
>> > >
>> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com>
>> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
>> > include
>> > >> sparse and dense Vector and Matrix types, backed on the Scala side by
>> > >> Breeze. Using these interfaces, you can construct DataFrames whose
>> > column
>> > >> types are vectors and matrices, and though the API isn’t terribly
>> rich,
>> > it
>> > >> is possible to run Python UDFs over such a DataFrame and get numpy
>> > ndarrays
>> > >> out of each row. However, if you’re using Spark’s Arrow serialization
>> > >> between the executor and Python workers, you get this
>> > >> UnsupportedOperationException:
>> > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
>> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
>> > >> execution/arrow/ArrowWriter.scala#L71
>> > >>
>> > >> I think it would be useful for Arrow to support something like a
>> column
>> > of
>> > >> tensors, and I’d like to see if anyone else here is interested in
>> such a
>> > >> thing.  If so, I’d like to propose adding it to the spec and getting
>> it
>> > >> implemented in at least Java and C++/Python.
>> > >>
>> > >> Some initial mildly-scattered thoughts:
>> > >>
>> > >> 1. You can certainly represent these today as List<Double> and
>> > >> List<List<Double>>, but then need to do some copying to get them back
>> > into
>> > >> numpy ndarrays.
>> > >>
>> > >> 2. In some cases it might be useful to know that a column contains
>> 3x3x4
>> > >> tensors, for example, and not just that there are three dimensions as
>> > you’d
>> > >> get with List<List<List<Double>>>.  This could constrain what
>> operations
>> > >> are meaningful (for example, in Spark you could imagine type checking
>> > that
>> > >> verifies dimension alignment for matrix multiplication).
>> > >>
>> > >> 3. You could approximate that with a FixedSizeList and metadata about
>> > the
>> > >> tensor shape.
>> > >>
>> > >> 4. But I kind of feel like this is generally useful enough that it’s
>> > worth
>> > >> having one implementation of it (well, one for each runtime) in Arrow.
>> > >>
>> > >> 5. Or, maybe everyone here thinks Spark should just do this with
>> > metadata?
>> > >>
>> > >> Curious to hear what you all think.
>> > >>
>> > >> Thanks,
>> > >> Leif
>> > >>
>> > >> --
>> > >> --
>> > >> Cheers,
>> > >> Leif
>> > >>
>> >
>> --
>> --
>> Cheers,
>> Leif
>>

Re: Tensor column types in arrow

Posted by Li Jin <ic...@gmail.com>.

What do people think whether "shape" should be included as a optional part
of schema metadata or a required part of the schema itself?

I feel having it be required might be too restrictive for interop with
other systems.

On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <le...@gmail.com> wrote:

> My gut feeling is that such a column type should specify both the shape and
> primitive type of all values in the column. I can’t think of a common use
> case that requires differently shaped tensors in a single column.
>
> Can anyone here come up with such a use case?
>
> If not, I can try to draft a proposal change to the spec that adds these
> types. The next question is whether such a change can make it in (with c++
> and java implementations) before 1.0.
> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <we...@gmail.com> wrote:
>
> > > As far as I know, there is an implementation of tensor type in
> > C++/Python already. Should we just finalize the spec and add
> implementation
> > to Java?
> >
> > There is nothing specified yet as far as a *column* of
> > ndarrays/tensors. We defined Tensor metadata for the purposes of
> > IPC/serialization but made no effort to incorporate such data into the
> > columnar format.
> >
> > There are likely many ways to implement column whose values are
> > ndarrays, each cell with its own shape. Whether we would want to
> > permit each cell to have a different ndarray cell type is another
> > question (i.e. would we want to constrain every cell in a column to
> > contain ndarrays of a particular type, like float64)
> >
> > So there's a couple of questions
> >
> > * How to represent the data using the columnar format
> > * How to incorporate ndarray metadata into columnar schemas
> >
> > - Wes
> >
> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ic...@gmail.com> wrote:
> > > As far as I know, there is an implementation of tensor type in
> C++/Python
> > > already. Should we just finalize the spec and add implementation to
> Java?
> > >
> > > On the Spark side, it's probably more complicated as Vector and Matrix
> > are
> > > not "first class" types in Spark SQL. Spark ML implements them as UDT
> > > (user-defined types) so it's not clear how to make Spark/Arrow
> converter
> > > work with them.
> > >
> > > I wonder if Bryan and Holden have some more thoughts on that?
> > >
> > > Li
> > >
> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
> > include
> > >> sparse and dense Vector and Matrix types, backed on the Scala side by
> > >> Breeze. Using these interfaces, you can construct DataFrames whose
> > column
> > >> types are vectors and matrices, and though the API isn’t terribly
> rich,
> > it
> > >> is possible to run Python UDFs over such a DataFrame and get numpy
> > ndarrays
> > >> out of each row. However, if you’re using Spark’s Arrow serialization
> > >> between the executor and Python workers, you get this
> > >> UnsupportedOperationException:
> > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> > >> execution/arrow/ArrowWriter.scala#L71
> > >>
> > >> I think it would be useful for Arrow to support something like a
> column
> > of
> > >> tensors, and I’d like to see if anyone else here is interested in
> such a
> > >> thing.  If so, I’d like to propose adding it to the spec and getting
> it
> > >> implemented in at least Java and C++/Python.
> > >>
> > >> Some initial mildly-scattered thoughts:
> > >>
> > >> 1. You can certainly represent these today as List<Double> and
> > >> List<List<Double>>, but then need to do some copying to get them back
> > into
> > >> numpy ndarrays.
> > >>
> > >> 2. In some cases it might be useful to know that a column contains
> 3x3x4
> > >> tensors, for example, and not just that there are three dimensions as
> > you’d
> > >> get with List<List<List<Double>>>.  This could constrain what
> operations
> > >> are meaningful (for example, in Spark you could imagine type checking
> > that
> > >> verifies dimension alignment for matrix multiplication).
> > >>
> > >> 3. You could approximate that with a FixedSizeList and metadata about
> > the
> > >> tensor shape.
> > >>
> > >> 4. But I kind of feel like this is generally useful enough that it’s
> > worth
> > >> having one implementation of it (well, one for each runtime) in Arrow.
> > >>
> > >> 5. Or, maybe everyone here thinks Spark should just do this with
> > metadata?
> > >>
> > >> Curious to hear what you all think.
> > >>
> > >> Thanks,
> > >> Leif
> > >>
> > >> --
> > >> --
> > >> Cheers,
> > >> Leif
> > >>
> >
> --
> --
> Cheers,
> Leif
>

Re: Tensor column types in arrow

Posted by Leif Walsh <le...@gmail.com>.

My gut feeling is that such a column type should specify both the shape and
primitive type of all values in the column. I can’t think of a common use
case that requires differently shaped tensors in a single column.

Can anyone here come up with such a use case?

If not, I can try to draft a proposal change to the spec that adds these
types. The next question is whether such a change can make it in (with c++
and java implementations) before 1.0.
On Mon, Apr 9, 2018 at 17:36 Wes McKinney <we...@gmail.com> wrote:

> > As far as I know, there is an implementation of tensor type in
> C++/Python already. Should we just finalize the spec and add implementation
> to Java?
>
> There is nothing specified yet as far as a *column* of
> ndarrays/tensors. We defined Tensor metadata for the purposes of
> IPC/serialization but made no effort to incorporate such data into the
> columnar format.
>
> There are likely many ways to implement column whose values are
> ndarrays, each cell with its own shape. Whether we would want to
> permit each cell to have a different ndarray cell type is another
> question (i.e. would we want to constrain every cell in a column to
> contain ndarrays of a particular type, like float64)
>
> So there's a couple of questions
>
> * How to represent the data using the columnar format
> * How to incorporate ndarray metadata into columnar schemas
>
> - Wes
>
> On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ic...@gmail.com> wrote:
> > As far as I know, there is an implementation of tensor type in C++/Python
> > already. Should we just finalize the spec and add implementation to Java?
> >
> > On the Spark side, it's probably more complicated as Vector and Matrix
> are
> > not "first class" types in Spark SQL. Spark ML implements them as UDT
> > (user-defined types) so it's not clear how to make Spark/Arrow converter
> > work with them.
> >
> > I wonder if Bryan and Holden have some more thoughts on that?
> >
> > Li
> >
> > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> I’ve been doing some work lately with Spark’s ML interfaces, which
> include
> >> sparse and dense Vector and Matrix types, backed on the Scala side by
> >> Breeze. Using these interfaces, you can construct DataFrames whose
> column
> >> types are vectors and matrices, and though the API isn’t terribly rich,
> it
> >> is possible to run Python UDFs over such a DataFrame and get numpy
> ndarrays
> >> out of each row. However, if you’re using Spark’s Arrow serialization
> >> between the executor and Python workers, you get this
> >> UnsupportedOperationException:
> >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> >> execution/arrow/ArrowWriter.scala#L71
> >>
> >> I think it would be useful for Arrow to support something like a column
> of
> >> tensors, and I’d like to see if anyone else here is interested in such a
> >> thing.  If so, I’d like to propose adding it to the spec and getting it
> >> implemented in at least Java and C++/Python.
> >>
> >> Some initial mildly-scattered thoughts:
> >>
> >> 1. You can certainly represent these today as List<Double> and
> >> List<List<Double>>, but then need to do some copying to get them back
> into
> >> numpy ndarrays.
> >>
> >> 2. In some cases it might be useful to know that a column contains 3x3x4
> >> tensors, for example, and not just that there are three dimensions as
> you’d
> >> get with List<List<List<Double>>>.  This could constrain what operations
> >> are meaningful (for example, in Spark you could imagine type checking
> that
> >> verifies dimension alignment for matrix multiplication).
> >>
> >> 3. You could approximate that with a FixedSizeList and metadata about
> the
> >> tensor shape.
> >>
> >> 4. But I kind of feel like this is generally useful enough that it’s
> worth
> >> having one implementation of it (well, one for each runtime) in Arrow.
> >>
> >> 5. Or, maybe everyone here thinks Spark should just do this with
> metadata?
> >>
> >> Curious to hear what you all think.
> >>
> >> Thanks,
> >> Leif
> >>
> >> --
> >> --
> >> Cheers,
> >> Leif
> >>
>
-- 
-- 
Cheers,
Leif

Re: Tensor column types in arrow

Posted by Wes McKinney <we...@gmail.com>.

> As far as I know, there is an implementation of tensor type in C++/Python already. Should we just finalize the spec and add implementation to Java?

There is nothing specified yet as far as a *column* of
ndarrays/tensors. We defined Tensor metadata for the purposes of
IPC/serialization but made no effort to incorporate such data into the
columnar format.

There are likely many ways to implement column whose values are
ndarrays, each cell with its own shape. Whether we would want to
permit each cell to have a different ndarray cell type is another
question (i.e. would we want to constrain every cell in a column to
contain ndarrays of a particular type, like float64)

So there's a couple of questions

* How to represent the data using the columnar format
* How to incorporate ndarray metadata into columnar schemas

- Wes

On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ic...@gmail.com> wrote:
> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com> wrote:
>
>> Hi all,
>>
>> I’ve been doing some work lately with Spark’s ML interfaces, which include
>> sparse and dense Vector and Matrix types, backed on the Scala side by
>> Breeze. Using these interfaces, you can construct DataFrames whose column
>> types are vectors and matrices, and though the API isn’t terribly rich, it
>> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
>> out of each row. However, if you’re using Spark’s Arrow serialization
>> between the executor and Python workers, you get this
>> UnsupportedOperationException:
>> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
>> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
>> execution/arrow/ArrowWriter.scala#L71
>>
>> I think it would be useful for Arrow to support something like a column of
>> tensors, and I’d like to see if anyone else here is interested in such a
>> thing.  If so, I’d like to propose adding it to the spec and getting it
>> implemented in at least Java and C++/Python.
>>
>> Some initial mildly-scattered thoughts:
>>
>> 1. You can certainly represent these today as List<Double> and
>> List<List<Double>>, but then need to do some copying to get them back into
>> numpy ndarrays.
>>
>> 2. In some cases it might be useful to know that a column contains 3x3x4
>> tensors, for example, and not just that there are three dimensions as you’d
>> get with List<List<List<Double>>>.  This could constrain what operations
>> are meaningful (for example, in Spark you could imagine type checking that
>> verifies dimension alignment for matrix multiplication).
>>
>> 3. You could approximate that with a FixedSizeList and metadata about the
>> tensor shape.
>>
>> 4. But I kind of feel like this is generally useful enough that it’s worth
>> having one implementation of it (well, one for each runtime) in Arrow.
>>
>> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
>>
>> Curious to hear what you all think.
>>
>> Thanks,
>> Leif
>>
>> --
>> --
>> Cheers,
>> Leif
>>

Re: Tensor column types in arrow

Posted by Leif Walsh <le...@gmail.com>.

The tensor type in the c++ api is a stand-alone object afaict, Phillip and
I were unable to construct an arrow column of them. I agree that it’s a
good starting point, one interpretation of what I’m suggesting is that we
take it as the reference implementation, add it to the spec, and write the
java implementation.
On Mon, Apr 9, 2018 at 17:30 Li Jin <ic...@gmail.com> wrote:

> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com> wrote:
>
> > Hi all,
> >
> > I’ve been doing some work lately with Spark’s ML interfaces, which
> include
> > sparse and dense Vector and Matrix types, backed on the Scala side by
> > Breeze. Using these interfaces, you can construct DataFrames whose column
> > types are vectors and matrices, and though the API isn’t terribly rich,
> it
> > is possible to run Python UDFs over such a DataFrame and get numpy
> ndarrays
> > out of each row. However, if you’re using Spark’s Arrow serialization
> > between the executor and Python workers, you get this
> > UnsupportedOperationException:
> > https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> > d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> > execution/arrow/ArrowWriter.scala#L71
> >
> > I think it would be useful for Arrow to support something like a column
> of
> > tensors, and I’d like to see if anyone else here is interested in such a
> > thing.  If so, I’d like to propose adding it to the spec and getting it
> > implemented in at least Java and C++/Python.
> >
> > Some initial mildly-scattered thoughts:
> >
> > 1. You can certainly represent these today as List<Double> and
> > List<List<Double>>, but then need to do some copying to get them back
> into
> > numpy ndarrays.
> >
> > 2. In some cases it might be useful to know that a column contains 3x3x4
> > tensors, for example, and not just that there are three dimensions as
> you’d
> > get with List<List<List<Double>>>.  This could constrain what operations
> > are meaningful (for example, in Spark you could imagine type checking
> that
> > verifies dimension alignment for matrix multiplication).
> >
> > 3. You could approximate that with a FixedSizeList and metadata about the
> > tensor shape.
> >
> > 4. But I kind of feel like this is generally useful enough that it’s
> worth
> > having one implementation of it (well, one for each runtime) in Arrow.
> >
> > 5. Or, maybe everyone here thinks Spark should just do this with
> metadata?
> >
> > Curious to hear what you all think.
> >
> > Thanks,
> > Leif
> >
> > --
> > --
> > Cheers,
> > Leif
> >
>
-- 
-- 
Cheers,
Leif

Re: Tensor column types in arrow

Posted by Li Jin <ic...@gmail.com>.

As far as I know, there is an implementation of tensor type in C++/Python
already. Should we just finalize the spec and add implementation to Java?

On the Spark side, it's probably more complicated as Vector and Matrix are
not "first class" types in Spark SQL. Spark ML implements them as UDT
(user-defined types) so it's not clear how to make Spark/Arrow converter
work with them.

I wonder if Bryan and Holden have some more thoughts on that?

Li

On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <le...@gmail.com> wrote:

> Hi all,
>
> I’ve been doing some work lately with Spark’s ML interfaces, which include
> sparse and dense Vector and Matrix types, backed on the Scala side by
> Breeze. Using these interfaces, you can construct DataFrames whose column
> types are vectors and matrices, and though the API isn’t terribly rich, it
> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
> out of each row. However, if you’re using Spark’s Arrow serialization
> between the executor and Python workers, you get this
> UnsupportedOperationException:
> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> execution/arrow/ArrowWriter.scala#L71
>
> I think it would be useful for Arrow to support something like a column of
> tensors, and I’d like to see if anyone else here is interested in such a
> thing.  If so, I’d like to propose adding it to the spec and getting it
> implemented in at least Java and C++/Python.
>
> Some initial mildly-scattered thoughts:
>
> 1. You can certainly represent these today as List<Double> and
> List<List<Double>>, but then need to do some copying to get them back into
> numpy ndarrays.
>
> 2. In some cases it might be useful to know that a column contains 3x3x4
> tensors, for example, and not just that there are three dimensions as you’d
> get with List<List<List<Double>>>.  This could constrain what operations
> are meaningful (for example, in Spark you could imagine type checking that
> verifies dimension alignment for matrix multiplication).
>
> 3. You could approximate that with a FixedSizeList and metadata about the
> tensor shape.
>
> 4. But I kind of feel like this is generally useful enough that it’s worth
> having one implementation of it (well, one for each runtime) in Arrow.
>
> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
>
> Curious to hear what you all think.
>
> Thanks,
> Leif
>
> --
> --
> Cheers,
> Leif
>