You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Buroojy (JIRA)" <ji...@apache.org> on 2015/07/18 00:46:05 UTC
[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

    [ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632061#comment-14632061 ] 

Nick Buroojy commented on SPARK-8418:
-------------------------------------

I like this idea a lot, and think it would solve one of our main performance issues with the ml api.

Our data set has hundreds of string features that we need to convert into binary vectors. We have found the latency overhead of processing the features one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a custom Estimator to vectorize all string columns with only a couple passes over the data set and found significant performance gains.

I suspect that we aren't the only users with many columns, so we would love to fix this issue upstream with some sort of multi-column interface to transformers and estimators.

I suppose we could make do with the Vector or Array interface using the VectorAssembler as described in this ticket; however, I think the cleanest interface for us would be a Map from source column to dest column.

As far as sharing code, there are at least two strategies:
1) Use the single value implementation as it is today, and add a multi-value view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) would return a pipeline of [StringVectorizer.setInputCol(A), StringVectorizer(B)]
2) Reimplement each transformer to support a multi-value implementation and make the single-value interface a trivial invocation of the multi-value code. For example StringVectorizer.setInputCol(A) would invoke StringVectorizer.setInputCols(Array[A])

The obvious downside of 1 is that it wouldn't address the performance issues we ran into with hundreds of columns. The upsides are minimal implementation effort and simpler code to maintain.

The main downside of 2 is more upfront effort to implement multi-value transformations, but the upside is reasonable performance with "wide" data sets.

I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface could be solidified first with the 1 implementation, then over time the key transformers, like StringVectorizer, could be rewritten to 2?

You mentioned that this would require a short design doc. Can I help with that?

> Add single- and multi-value support to ML Transformers
> ------------------------------------------------------
>
>                 Key: SPARK-8418
>                 URL: https://issues.apache.org/jira/browse/SPARK-8418
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org