You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Yu Ishikawa <yu...@gmail.com> on 2014/10/08 13:19:47 UTC

Standardized Distance Functions in MLlib

Hi all, 

In my limited understanding of the MLlib, it is a good idea to use the
various distance functions on some machine learning algorithms. For example,
we can only use Euclidean distance metric in KMeans. And I am tackling with
contributing hierarchical clustering to MLlib
(https://issues.apache.org/jira/browse/SPARK-2429). I would like to support
the various distance functions in it.

Should we support the standardized distance function in MLlib or not?
You know, Spark depends on Breeze. So I think we have two approaches in
order to use distance functions in MLlib. One is implementing some distance
functions in MLlib. The other is wrapping the functions of Breeze. And I am
a bit worried about using Breeze directly in Spark. For example,  we can't
absolutely control the release of Breeze. 

I sent a PR before. But it is stopping. I'd like to get your thoughts on it,
community.
https://github.com/apache/spark/pull/1964#issuecomment-54953348

Best,



-----
-- Yu Ishikawa
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Standardized Distance Functions in MLlib

Posted by Yu Ishikawa <yu...@gmail.com>.

Hi Xiangrui, 

Thank you very much for replying and letting me know that you upgraded
breeze to 0.10 yesterday.
Sorry that I didn't know that.

> We don't want to maintain 
> another copy of the implementation in MLlib to keep the maintenance 
> cost low. Both spark and breeze are open-source projects. We should 
> try our best to avoid duplicate effort and forking, even though we 
> don't have control the release of breeze. 

I got it. I agree with keeping linear algebra in MLlib lightweight.

> It would be really nice if you can help review it and discuss how to embed
> distance measures there. 

All right. I will check it.

thanks,
Yu Ishikawa



-----
-- Yu Ishikawa
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697p8711.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Standardized Distance Functions in MLlib

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Yu,

We upgraded breeze to 0.10 yesterday. So we can call the distance
functions you contributed to breeze easily. We don't want to maintain
another copy of the implementation in MLlib to keep the maintenance
cost low. Both spark and breeze are open-source projects. We should
try our best to avoid duplicate effort and forking, even though we
don't have control the release of breeze.

As we discussed in the PR, if we want users to call them directly,
they should live in breeze. If we want users to specify them in
clustering algorithms, we should hide the implementation from users.
So simple wrappers over the breeze implementation should be
sufficient. We are reviewing

https://github.com/apache/spark/pull/2634

and try to see how we can embed distance measures there. In the
k-means implementation, we don't use (Vector, Vector) => Double.
Instead, we cache the norms and use inner product to derive the
distance, which is faster and takes advantage of sparsity. It would be
really nice if you can help review it and discuss how to embed
distance measures there. Thanks!

Best,
Xiangrui

On Wed, Oct 8, 2014 at 4:19 AM, Yu Ishikawa
<yu...@gmail.com> wrote:
> Hi all,
>
> In my limited understanding of the MLlib, it is a good idea to use the
> various distance functions on some machine learning algorithms. For example,
> we can only use Euclidean distance metric in KMeans. And I am tackling with
> contributing hierarchical clustering to MLlib
> (https://issues.apache.org/jira/browse/SPARK-2429). I would like to support
> the various distance functions in it.
>
> Should we support the standardized distance function in MLlib or not?
> You know, Spark depends on Breeze. So I think we have two approaches in
> order to use distance functions in MLlib. One is implementing some distance
> functions in MLlib. The other is wrapping the functions of Breeze. And I am
> a bit worried about using Breeze directly in Spark. For example,  we can't
> absolutely control the release of Breeze.
>
> I sent a PR before. But it is stopping. I'd like to get your thoughts on it,
> community.
> https://github.com/apache/spark/pull/1964#issuecomment-54953348
>
> Best,
>
>
>
> -----
> -- Yu Ishikawa
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org