You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Leif Walsh <le...@gmail.com> on 2018/05/12 22:44:51 UTC

Re: Possible SPIP to improve matrix and vector column type support

I filed an SPIP for this at
https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss!

On Wed, Apr 18, 2018 at 23:33 Leif Walsh <le...@gmail.com> wrote:

> I agree we should reuse as much as possible. For PySpark, I think the
> obvious choices of Breeze and numpy arrays already made make a lot of
> sense, I’m not sure about the other language bindings and would defer to
> others.
>
> I was under the impression that UDTs were gone and (probably?) not coming
> back. Did I miss something and they’re actually going to be better
> supported in the future? I think your second point (about separating
> expanding the primitives from expanding SQL support) is only really true if
> we’re getting UDTs back.
>
> You’ve obviously seen more of the history here than me. Do you have a
> sense of why the efforts you mentioned never went anywhere? I don’t think
> this is strictly about “mllib local”, it’s more about generic linalg, so
> 19653 feels like the closest to what I’m after, but it looks to me like
> that one just fizzled out, rather than a real back and forth.
>
> Does this just need something like a persistent product manager to scope
> out the effort, champion it, and push it forward?
> On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Thanks for the thoughts!  We've gone back and forth quite a bit about
>> local linear algebra support in Spark.  For reference, there have been some
>> discussions here:
>> https://issues.apache.org/jira/browse/SPARK-6442
>> https://issues.apache.org/jira/browse/SPARK-16365
>> https://issues.apache.org/jira/browse/SPARK-19653
>>
>> Overall, I like the idea of improving linear algebra support, especially
>> given the rise of Python numerical processing & deep learning.  But some
>> considerations I'd list include:
>> * There are great linear algebra libraries out there, and it would be
>> ideal to reuse those as much as possible.
>> * SQL support for linear algebra can be a separate effort from expanding
>> linear algebra primitives.
>> * It would be valuable to discuss external types as UDTs (which can be
>> hacked with numpy and scipy types now) vs. adding linear algebra types to
>> native Spark SQL.
>>
>>
>> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <le...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>>> I’ve found myself wanting more.
>>>
>>> There is a minor issue in that with the arrow serialization enabled,
>>> these types don’t serialize properly in python UDF calls or in toPandas.
>>> There’s a natural representation for them in numpy.ndarray, and I’ve
>>> started a conversation with the arrow community about supporting
>>> tensor-valued columns, but that might be a ways out. In the meantime, I
>>> think we can fix this by using the FixedSizeBinary column type in arrow,
>>> together with some metadata describing the tensor shape (list of dimension
>>> sizes).
>>>
>>> The larger issue, for which I intend to submit an SPIP soon, is that
>>> these types could be better supported at the API layer, regardless of
>>> serialization. In the limit, we could consider the entire numpy ndarray
>>> surface area as a target. At the minimum, what I’m thinking is that these
>>> types should support column operations like matrix multiply, transpose,
>>> inner and outer product, etc., and maybe have a more ergonomic construction
>>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>>> VectorAssembler API is kind of clunky.
>>>
>>> One possibility here is to restrict the tensor column types such that
>>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>>> for operations to check validity before execution, for example, a matrix
>>> multiply could check dimension match and fail fast. However, there might be
>>> use cases for a column to contain variable shape tensors, I’m open to
>>> discussion here.
>>>
>>> What do you all think?
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> --
> Cheers,
> Leif
>
-- 
-- 
Cheers,
Leif