You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2019/12/17 11:12:00 UTC

[jira] [Commented] (SPARK-30286) Some thoughts on new features for MLLIB

    [ https://issues.apache.org/jira/browse/SPARK-30286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998096#comment-16998096 ] 

zhengruifeng commented on SPARK-30286:
--------------------------------------

 It seem that the last roadmap for mllib is for 2.0, and it seems that the community has not discuss the future of mllib for a long time.

Above is what I am thinking of for sometime. Among them, I tend to include three in ML: 1,*mini-batch KMeans*, 2,*vector validator*, 3,*Vectors enhancement*

friendly ping [~srowen]  [~viirya]  how do you think of this? Thanks

 

> Some thoughts on new features for MLLIB
> ---------------------------------------
>
>                 Key: SPARK-30286
>                 URL: https://issues.apache.org/jira/browse/SPARK-30286
>             Project: Spark
>          Issue Type: Wish
>          Components: ML
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Minor
>
> Some thoughts on new features for ML:
> 1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used algs in MLLIB, mini-batch KMeans is much faster than KMeans with [compareable results|https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#sphx-glr-auto-examples-cluster-plot-mini-batch-kmeans-py]; in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two params in existing KMeans.
> 2, classification & regression:
>  2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of tree ensamble, it has a lower variance than its brother RandomForest, it seems that in online contests extratrees are more and more used; It seems that it can be easily impled atop existing ensamble impls;
>  2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should be easy to impl it as a new modelType in MLLIB's NB;
> 3, features:
>  3.1 *vector validator*: a new UnaryTransformer that check whether a vector column meets some requirements, like non-NaN, non-negative, positive, all values are binary/int, all vectors are dense/sparse, numFetures; Current some impls deal with invalid values, but most have not. For example, we first scaler the input by MinMaxScaler, however MinMaxScaler will ignore NaN in training and keep the NaN in transformation, then the scaled dataset is feed into LinearRegression, at the end I obtain a LinearRegressionModel with NaN LinearRegression. In the whole pipeline, no exception is thrown. With this validator, the pipeline can fail ahead.
>  3.2 inverse transform for models/transformers: we may add a new bool param HasInverseTransform;
>  3.3 non-linear transformation: quantile transforms and power transforms (including famous Box-Cox method), map data from any distribution to as close to another distribution (mostly Gaussian); _I am working on this, since I need this feature recently_;
>  3.4 similarity search: in my experience, Approximate Nearest Neighbors based on KMeans provides more accurate result than LSH, can we follow some famous libraries like Facebook-FAISS to impl a new ANN?
> 4, warm start: initialize the model from a previous model, ONLY the coefficients are used (the params related to the previous model are ignored), maybe a new string param HasInitialModelPath can be added at first.
> 5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, nonZeroIterator*; so that we can impl some method based on Iterator[Int, Double] instead of ml.Vector/mllib.Vector, and reuse it in both sides without vector conversions.
> 6, parameter server: there were several tickets for it. It should be super useful and will provide efficient gradient-based solver for many algs. I also know there were some efforts to impl it atop spark, like Tencent-Angel & [Glint|https://github.com/Angel-ML/angel]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org