You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Leif Walsh <le...@gmail.com> on 2018/04/12 00:53:12 UTC

Possible SPIP to improve matrix and vector column type support

Hi all,

I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
I’ve found myself wanting more.

There is a minor issue in that with the arrow serialization enabled, these
types don’t serialize properly in python UDF calls or in toPandas. There’s
a natural representation for them in numpy.ndarray, and I’ve started a
conversation with the arrow community about supporting tensor-valued
columns, but that might be a ways out. In the meantime, I think we can fix
this by using the FixedSizeBinary column type in arrow, together with some
metadata describing the tensor shape (list of dimension sizes).

The larger issue, for which I intend to submit an SPIP soon, is that these
types could be better supported at the API layer, regardless of
serialization. In the limit, we could consider the entire numpy ndarray
surface area as a target. At the minimum, what I’m thinking is that these
types should support column operations like matrix multiply, transpose,
inner and outer product, etc., and maybe have a more ergonomic construction
API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
VectorAssembler API is kind of clunky.

One possibility here is to restrict the tensor column types such that every
value must have the same shape, e.g. a 2x2 matrix. This would allow for
operations to check validity before execution, for example, a matrix
multiply could check dimension match and fail fast. However, there might be
use cases for a column to contain variable shape tensors, I’m open to
discussion here.

What do you all think?
-- 
-- 
Cheers,
Leif

Re: Possible SPIP to improve matrix and vector column type support

Posted by Leif Walsh <le...@gmail.com>.

I filed an SPIP for this at
https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss!

On Wed, Apr 18, 2018 at 23:33 Leif Walsh <le...@gmail.com> wrote:

> I agree we should reuse as much as possible. For PySpark, I think the
> obvious choices of Breeze and numpy arrays already made make a lot of
> sense, I’m not sure about the other language bindings and would defer to
> others.
>
> I was under the impression that UDTs were gone and (probably?) not coming
> back. Did I miss something and they’re actually going to be better
> supported in the future? I think your second point (about separating
> expanding the primitives from expanding SQL support) is only really true if
> we’re getting UDTs back.
>
> You’ve obviously seen more of the history here than me. Do you have a
> sense of why the efforts you mentioned never went anywhere? I don’t think
> this is strictly about “mllib local”, it’s more about generic linalg, so
> 19653 feels like the closest to what I’m after, but it looks to me like
> that one just fizzled out, rather than a real back and forth.
>
> Does this just need something like a persistent product manager to scope
> out the effort, champion it, and push it forward?
> On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Thanks for the thoughts!  We've gone back and forth quite a bit about
>> local linear algebra support in Spark.  For reference, there have been some
>> discussions here:
>> https://issues.apache.org/jira/browse/SPARK-6442
>> https://issues.apache.org/jira/browse/SPARK-16365
>> https://issues.apache.org/jira/browse/SPARK-19653
>>
>> Overall, I like the idea of improving linear algebra support, especially
>> given the rise of Python numerical processing & deep learning.  But some
>> considerations I'd list include:
>> * There are great linear algebra libraries out there, and it would be
>> ideal to reuse those as much as possible.
>> * SQL support for linear algebra can be a separate effort from expanding
>> linear algebra primitives.
>> * It would be valuable to discuss external types as UDTs (which can be
>> hacked with numpy and scipy types now) vs. adding linear algebra types to
>> native Spark SQL.
>>
>>
>> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <le...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>>> I’ve found myself wanting more.
>>>
>>> There is a minor issue in that with the arrow serialization enabled,
>>> these types don’t serialize properly in python UDF calls or in toPandas.
>>> There’s a natural representation for them in numpy.ndarray, and I’ve
>>> started a conversation with the arrow community about supporting
>>> tensor-valued columns, but that might be a ways out. In the meantime, I
>>> think we can fix this by using the FixedSizeBinary column type in arrow,
>>> together with some metadata describing the tensor shape (list of dimension
>>> sizes).
>>>
>>> The larger issue, for which I intend to submit an SPIP soon, is that
>>> these types could be better supported at the API layer, regardless of
>>> serialization. In the limit, we could consider the entire numpy ndarray
>>> surface area as a target. At the minimum, what I’m thinking is that these
>>> types should support column operations like matrix multiply, transpose,
>>> inner and outer product, etc., and maybe have a more ergonomic construction
>>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>>> VectorAssembler API is kind of clunky.
>>>
>>> One possibility here is to restrict the tensor column types such that
>>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>>> for operations to check validity before execution, for example, a matrix
>>> multiply could check dimension match and fail fast. However, there might be
>>> use cases for a column to contain variable shape tensors, I’m open to
>>> discussion here.
>>>
>>> What do you all think?
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> --
> Cheers,
> Leif
>
-- 
-- 
Cheers,
Leif

Re: Possible SPIP to improve matrix and vector column type support

Posted by Leif Walsh <le...@gmail.com>.

I agree we should reuse as much as possible. For PySpark, I think the
obvious choices of Breeze and numpy arrays already made make a lot of
sense, I’m not sure about the other language bindings and would defer to
others.

I was under the impression that UDTs were gone and (probably?) not coming
back. Did I miss something and they’re actually going to be better
supported in the future? I think your second point (about separating
expanding the primitives from expanding SQL support) is only really true if
we’re getting UDTs back.

You’ve obviously seen more of the history here than me. Do you have a sense
of why the efforts you mentioned never went anywhere? I don’t think this is
strictly about “mllib local”, it’s more about generic linalg, so 19653
feels like the closest to what I’m after, but it looks to me like that one
just fizzled out, rather than a real back and forth.

Does this just need something like a persistent product manager to scope
out the effort, champion it, and push it forward?
On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jo...@databricks.com> wrote:

> Thanks for the thoughts!  We've gone back and forth quite a bit about
> local linear algebra support in Spark.  For reference, there have been some
> discussions here:
> https://issues.apache.org/jira/browse/SPARK-6442
> https://issues.apache.org/jira/browse/SPARK-16365
> https://issues.apache.org/jira/browse/SPARK-19653
>
> Overall, I like the idea of improving linear algebra support, especially
> given the rise of Python numerical processing & deep learning.  But some
> considerations I'd list include:
> * There are great linear algebra libraries out there, and it would be
> ideal to reuse those as much as possible.
> * SQL support for linear algebra can be a separate effort from expanding
> linear algebra primitives.
> * It would be valuable to discuss external types as UDTs (which can be
> hacked with numpy and scipy types now) vs. adding linear algebra types to
> native Spark SQL.
>
>
> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <le...@gmail.com> wrote:
>
>> Hi all,
>>
>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>> I’ve found myself wanting more.
>>
>> There is a minor issue in that with the arrow serialization enabled,
>> these types don’t serialize properly in python UDF calls or in toPandas.
>> There’s a natural representation for them in numpy.ndarray, and I’ve
>> started a conversation with the arrow community about supporting
>> tensor-valued columns, but that might be a ways out. In the meantime, I
>> think we can fix this by using the FixedSizeBinary column type in arrow,
>> together with some metadata describing the tensor shape (list of dimension
>> sizes).
>>
>> The larger issue, for which I intend to submit an SPIP soon, is that
>> these types could be better supported at the API layer, regardless of
>> serialization. In the limit, we could consider the entire numpy ndarray
>> surface area as a target. At the minimum, what I’m thinking is that these
>> types should support column operations like matrix multiply, transpose,
>> inner and outer product, etc., and maybe have a more ergonomic construction
>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>> VectorAssembler API is kind of clunky.
>>
>> One possibility here is to restrict the tensor column types such that
>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>> for operations to check validity before execution, for example, a matrix
>> multiply could check dimension match and fail fast. However, there might be
>> use cases for a column to contain variable shape tensors, I’m open to
>> discussion here.
>>
>> What do you all think?
>> --
>> --
>> Cheers,
>> Leif
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 
-- 
Cheers,
Leif

Re: Possible SPIP to improve matrix and vector column type support

Posted by Joseph Bradley <jo...@databricks.com>.

Thanks for the thoughts!  We've gone back and forth quite a bit about local
linear algebra support in Spark.  For reference, there have been some
discussions here:
https://issues.apache.org/jira/browse/SPARK-6442
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-19653

Overall, I like the idea of improving linear algebra support, especially
given the rise of Python numerical processing & deep learning.  But some
considerations I'd list include:
* There are great linear algebra libraries out there, and it would be ideal
to reuse those as much as possible.
* SQL support for linear algebra can be a separate effort from expanding
linear algebra primitives.
* It would be valuable to discuss external types as UDTs (which can be
hacked with numpy and scipy types now) vs. adding linear algebra types to
native Spark SQL.

On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <le...@gmail.com> wrote:

> Hi all,
>
> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
> I’ve found myself wanting more.
>
> There is a minor issue in that with the arrow serialization enabled, these
> types don’t serialize properly in python UDF calls or in toPandas. There’s
> a natural representation for them in numpy.ndarray, and I’ve started a
> conversation with the arrow community about supporting tensor-valued
> columns, but that might be a ways out. In the meantime, I think we can fix
> this by using the FixedSizeBinary column type in arrow, together with some
> metadata describing the tensor shape (list of dimension sizes).
>
> The larger issue, for which I intend to submit an SPIP soon, is that these
> types could be better supported at the API layer, regardless of
> serialization. In the limit, we could consider the entire numpy ndarray
> surface area as a target. At the minimum, what I’m thinking is that these
> types should support column operations like matrix multiply, transpose,
> inner and outer product, etc., and maybe have a more ergonomic construction
> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
> VectorAssembler API is kind of clunky.
>
> One possibility here is to restrict the tensor column types such that
> every value must have the same shape, e.g. a 2x2 matrix. This would allow
> for operations to check validity before execution, for example, a matrix
> multiply could check dimension match and fail fast. However, there might be
> use cases for a column to contain variable shape tensors, I’m open to
> discussion here.
>
> What do you all think?
> --
> --
> Cheers,
> Leif
>

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>