You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by RJ Nowling <rn...@gmail.com> on 2014/07/08 19:54:00 UTC

Contributing to MLlib: Proposal for Clustering Algorithms

Hi all,

MLlib currently has one clustering algorithm implementation, KMeans.
It would benefit from having implementations of other clustering
algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
Clustering, and Affinity Propagation.

I recently submitted a PR [1] for a MiniBatch KMeans implementation,
and I saw an email on this list about interest in implementing Fuzzy
C-Means.

Based on Sean Owen's review of my MiniBatch KMeans code, it became
apparent that before I implement more clustering algorithms, it would
be useful to hammer out a framework to reduce code duplication and
implement a consistent API.

I'd like to gauge the interest and goals of the MLlib community:

1. Are you interested in having more clustering algorithms available?

2. Is the community interested in specifying a common framework?

Thanks!
RJ

[1] - https://github.com/apache/spark/pull/1248


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Sandy Ryza <sa...@cloudera.com>.

Having a common framework for clustering makes sense to me.  While we
should be careful about what algorithms we include, having solid
implementations of minibatch clustering and hierarchical clustering seems
like a worthwhile goal, and we should reuse as much code and APIs as
reasonable.


On Tue, Jul 8, 2014 at 1:19 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thanks, Hector! Your feedback is useful.
>
> On Tuesday, July 8, 2014, Hector Yee <he...@gmail.com> wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
> > <javascript:;>> wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowling@gmail.com <javascript:;>
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Hector! Your feedback is useful.

On Tuesday, July 8, 2014, Hector Yee <he...@gmail.com> wrote:

> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
> <javascript:;>> wrote:
>
> > Hi all,
> >
> > MLlib currently has one clustering algorithm implementation, KMeans.
> > It would benefit from having implementations of other clustering
> > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > Clustering, and Affinity Propagation.
> >
> > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > and I saw an email on this list about interest in implementing Fuzzy
> > C-Means.
> >
> > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > apparent that before I implement more clustering algorithms, it would
> > be useful to hammer out a framework to reduce code duplication and
> > implement a consistent API.
> >
> > I'd like to gauge the interest and goals of the MLlib community:
> >
> > 1. Are you interested in having more clustering algorithms available?
> >
> > 2. Is the community interested in specifying a common framework?
> >
> > Thanks!
> > RJ
> >
> > [1] - https://github.com/apache/spark/pull/1248
> >
> >
> > --
> > em rnowling@gmail.com <javascript:;>
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Jeremy Freeman <fr...@gmail.com>.

Hi RJ, that sounds like a great idea. I'd be happy to look over what you put
together.

-- Jeremy



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Jeremy.  I'm abandoning my initial approach, and I'll work on
optimizing your example (so it doesn't do the breeze-vector conversions
every time KMeans is called).  I need to finish a few other projects first,
though, so it may be a couple weeks.

In the mean time, Yu also created a JIRA for a hierarchical KMeans
implementation.  I pointed him to your example and a couple papers I found.

If you or Yu beat me to getting an implementation in, I'd be happy to
review it.  :)


On Wed, Aug 27, 2014 at 12:18 PM, Jeremy Freeman <fr...@gmail.com>
wrote:

> Hey RJ,
>
> Sorry for the delay, I'd be happy to take a look at this if you can post
> the code!
>
> I think splitting the largest cluster in each round is fairly common, but
> ideally it would be an option to do it one way or the other.
>
> -- Jeremy
>
> ---------------------
> jeremy freeman, phd
> neuroscientist
> @thefreemanlab
>
> On Aug 12, 2014, at 2:20 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Hi all,
>
> I wanted to follow up.
>
> I have a prototype for an optimized version of hierarchical k-means.  I
> wanted to get some feedback on my apporach.
>
> Jeremy's implementation splits the largest cluster in each round.  Is it
> better to do it that way or to split each cluster in half?
>
> Are there are any open-source examples that are being widely used in
> production?
>
> Thanks!
>
>
>
> On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> Nice to meet you, Jeremy!
>
> This is great!  Hierarchical clustering was next on my list --
> currently trying to get my PR for MiniBatch KMeans accepted.
>
> If it's cool with you, I'll try converting your code to fit in with
> the existing MLLib code as you suggest. I also need to review the
> Decision Tree code (as suggested above) to see how much of that can be
> reused.
>
> Maybe I can ask you to do a code review for me when I'm done?
>
>
>
>
>
> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
> <fr...@gmail.com> wrote:
>
> Hi all,
>
> Cool discussion! I agree that a more standardized API for clustering, and
> easy access to underlying routines, would be useful (we've also been
> discussing this when trying to develop streaming clustering algorithms,
> similar to https://github.com/apache/spark/pull/1361)
>
> For divisive, hierarchical clustering I implemented something awhile
>
> back,
>
> here's a gist.
>
> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>
> It does bisecting k-means clustering (with k=2), with a recursive class
>
> for
>
> keeping track of the tree. I also found this much better than
>
> agglomerative
>
> methods (for the reasons Hector points out).
>
> This needs to be cleaned up, and can surely be optimized (esp. by
>
> replacing
>
> the core KMeans step with existing MLLib code), but I can say I was
>
> running
>
> it successfully on quite large data sets.
>
> RJ, depending on where you are in your progress, I'd be happy to help
>
> work
>
> on this piece and / or have you use this as a jumping off point, if
>
> useful.
>
>
> -- Jeremy
>
>
>
> --
> View this message in context:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
>
> Sent from the Apache Spark Developers List mailing list archive at
>
> Nabble.com.
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>
>
>


-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Jeremy Freeman <fr...@gmail.com>.

Hey RJ,

Sorry for the delay, I'd be happy to take a look at this if you can post the code!

I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other.

-- Jeremy

---------------------
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Aug 12, 2014, at 2:20 PM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
> 
> I wanted to follow up.
> 
> I have a prototype for an optimized version of hierarchical k-means.  I
> wanted to get some feedback on my apporach.
> 
> Jeremy's implementation splits the largest cluster in each round.  Is it
> better to do it that way or to split each cluster in half?
> 
> Are there are any open-source examples that are being widely used in
> production?
> 
> Thanks!
> 
> 
> 
> On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rn...@gmail.com> wrote:
> 
>> Nice to meet you, Jeremy!
>> 
>> This is great!  Hierarchical clustering was next on my list --
>> currently trying to get my PR for MiniBatch KMeans accepted.
>> 
>> If it's cool with you, I'll try converting your code to fit in with
>> the existing MLLib code as you suggest. I also need to review the
>> Decision Tree code (as suggested above) to see how much of that can be
>> reused.
>> 
>> Maybe I can ask you to do a code review for me when I'm done?
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
>> <fr...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> Cool discussion! I agree that a more standardized API for clustering, and
>>> easy access to underlying routines, would be useful (we've also been
>>> discussing this when trying to develop streaming clustering algorithms,
>>> similar to https://github.com/apache/spark/pull/1361)
>>> 
>>> For divisive, hierarchical clustering I implemented something awhile
>> back,
>>> here's a gist.
>>> 
>>> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>>> 
>>> It does bisecting k-means clustering (with k=2), with a recursive class
>> for
>>> keeping track of the tree. I also found this much better than
>> agglomerative
>>> methods (for the reasons Hector points out).
>>> 
>>> This needs to be cleaned up, and can surely be optimized (esp. by
>> replacing
>>> the core KMeans step with existing MLLib code), but I can say I was
>> running
>>> it successfully on quite large data sets.
>>> 
>>> RJ, depending on where you are in your progress, I'd be happy to help
>> work
>>> on this piece and / or have you use this as a jumping off point, if
>> useful.
>>> 
>>> -- Jeremy
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> 
>> 
>> --
>> em rnowling@gmail.com
>> c 954.496.2314
>> 
> 
> 
> 
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means.  I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round.  Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!



On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling <rn...@gmail.com> wrote:

> Nice to meet you, Jeremy!
>
> This is great!  Hierarchical clustering was next on my list --
> currently trying to get my PR for MiniBatch KMeans accepted.
>
> If it's cool with you, I'll try converting your code to fit in with
> the existing MLLib code as you suggest. I also need to review the
> Decision Tree code (as suggested above) to see how much of that can be
> reused.
>
> Maybe I can ask you to do a code review for me when I'm done?
>
>
>
>
>
> On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
> <fr...@gmail.com> wrote:
> > Hi all,
> >
> > Cool discussion! I agree that a more standardized API for clustering, and
> > easy access to underlying routines, would be useful (we've also been
> > discussing this when trying to develop streaming clustering algorithms,
> > similar to https://github.com/apache/spark/pull/1361)
> >
> > For divisive, hierarchical clustering I implemented something awhile
> back,
> > here's a gist.
> >
> > https://gist.github.com/freeman-lab/5947e7c53b368fe90371
> >
> > It does bisecting k-means clustering (with k=2), with a recursive class
> for
> > keeping track of the tree. I also found this much better than
> agglomerative
> > methods (for the reasons Hector points out).
> >
> > This needs to be cleaned up, and can surely be optimized (esp. by
> replacing
> > the core KMeans step with existing MLLib code), but I can say I was
> running
> > it successfully on quite large data sets.
> >
> > RJ, depending on where you are in your progress, I'd be happy to help
> work
> > on this piece and / or have you use this as a jumping off point, if
> useful.
> >
> > -- Jeremy
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Nice to meet you, Jeremy!

This is great!  Hierarchical clustering was next on my list --
currently trying to get my PR for MiniBatch KMeans accepted.

If it's cool with you, I'll try converting your code to fit in with
the existing MLLib code as you suggest. I also need to review the
Decision Tree code (as suggested above) to see how much of that can be
reused.

Maybe I can ask you to do a code review for me when I'm done?





On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
<fr...@gmail.com> wrote:
> Hi all,
>
> Cool discussion! I agree that a more standardized API for clustering, and
> easy access to underlying routines, would be useful (we've also been
> discussing this when trying to develop streaming clustering algorithms,
> similar to https://github.com/apache/spark/pull/1361)
>
> For divisive, hierarchical clustering I implemented something awhile back,
> here's a gist.
>
> https://gist.github.com/freeman-lab/5947e7c53b368fe90371
>
> It does bisecting k-means clustering (with k=2), with a recursive class for
> keeping track of the tree. I also found this much better than agglomerative
> methods (for the reasons Hector points out).
>
> This needs to be cleaned up, and can surely be optimized (esp. by replacing
> the core KMeans step with existing MLLib code), but I can say I was running
> it successfully on quite large data sets.
>
> RJ, depending on where you are in your progress, I'd be happy to help work
> on this piece and / or have you use this as a jumping off point, if useful.
>
> -- Jeremy
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Jeremy Freeman <fr...@gmail.com>.

Hi all, 

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361) 

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist. 

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets. 

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful. 

-- Jeremy 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Nick Pentreath <ni...@gmail.com>.

Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox

On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling <rn...@gmail.com> wrote:

> I went ahead and created JIRAs.
> JIRA for Hierarchical Clustering:
> https://issues.apache.org/jira/browse/SPARK-2429
> JIRA for Standarized Clustering APIs:
> https://issues.apache.org/jira/browse/SPARK-2430
> Before submitting a PR for the standardized API, I want to implement a
> few clustering algorithms for myself to get a good "feel" for how to
> structure such an API.  If others with more experience want to dive
> into designing the API, though, that would allow us to get moving more
> quickly.
> On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath <ni...@gmail.com> wrote:
>> Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great.
>>
>>
>> I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations.
>> —
>> Sent from Mailbox
>>
>> On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Thanks everyone for the input.
>>> So it seems what people want is:
>>> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
>>> conquer approach, look at DecisionTree implementation as a reference)
>>> * Restructure 3 Kmeans clustering algorithm implementations to prevent
>>> code duplication and conform to a consistent API where possible
>>> If this is correct, I'll start work on that.  How would it be best to
>>> structure it? Should I submit separate JIRAs / PRs for refactoring of
>>> current code, MiniBatch KMeans, and Hierarchical or keep my current
>>> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
>>> for Hierarchical KMeans that builds on top of that?
>>> Thanks!
>>> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <he...@gmail.com> wrote:
>>>> Yeah if one were to replace the objective function in decision tree with
>>>> minimizing the variance of the leaf nodes it would be a hierarchical
>>>> clusterer.
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <ev...@gmail.com>
>>>> wrote:
>>>>
>>>>> If you're thinking along these lines, have a look at the DecisionTree
>>>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>>>> multiple passes over the data by computing several splits at each level of
>>>>> tree building. The tradeoff is increased model state and computation per
>>>>> pass over the data, but fewer total passes and hopefully lower
>>>>> communication overheads than, say, shuffling data around that belongs to
>>>>> one cluster or another. Something like that could work here as well.
>>>>>
>>>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>>>> efficient way to implement it, though.
>>>>>
>>>>>
>>>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:
>>>>>
>>>>> > No was thinking more top-down:
>>>>> >
>>>>> > assuming a distributed kmeans system already existing, recursively apply
>>>>> > the kmeans algorithm on data already partitioned by the previous level of
>>>>> > kmeans.
>>>>> >
>>>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>>>> > because they assume there is already a distance metric for items to
>>>>> items.
>>>>> > This makes it hard to cluster new content. The distances between sibling
>>>>> > clusters is also hard to compute (if you have thrown away the similarity
>>>>> > matrix), do you count paths to same parent node if you are computing
>>>>> > distances between items in two adjacent nodes for example. It is also a
>>>>> bit
>>>>> > harder to distribute the computation for bottom up approaches as you have
>>>>> > to already find the nearest neighbor to an item to begin the process.
>>>>> >
>>>>> >
>>>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>>> >
>>>>> > > The scikit-learn implementation may be of interest:
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> >
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>>>> > >
>>>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>>>> > > chosen to minimize variance.
>>>>> > >
>>>>> > > Their code is under a BSD license so it can be used as a template.
>>>>> > >
>>>>> > > Is something like that you were thinking Hector?
>>>>> > >
>>>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>>> > > wrote:
>>>>> > > > sure. more interesting problem here is choosing k at each level.
>>>>> Kernel
>>>>> > > > methods seem to be most promising.
>>>>> > > >
>>>>> > > >
>>>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
>>>>> > wrote:
>>>>> > > >
>>>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>>>> > k-means
>>>>> > > >> again on each cluster.
>>>>> > > >>
>>>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>>>> > some
>>>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>>>> > with
>>>>> > > >> negative dot or cosine may help.
>>>>> > > >>
>>>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>>>> > why I
>>>>> > > >> like hierarchical clustering is it makes for faster inference
>>>>> > especially
>>>>> > > >> over billions of users.
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>>> >
>>>>> > > >> wrote:
>>>>> > > >>
>>>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>>>> > > thanks.
>>>>> > > >> >
>>>>> > > >> >
>>>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
>>>>> > > wrote:
>>>>> > > >> >
>>>>> > > >> > > I would say for bigdata applications the most useful would be
>>>>> > > >> > hierarchical
>>>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>>>> > > >> > centroids.
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
>>>>> >
>>>>> > > >> wrote:
>>>>> > > >> > >
>>>>> > > >> > > > Hi all,
>>>>> > > >> > > >
>>>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>>>> > > KMeans.
>>>>> > > >> > > > It would benefit from having implementations of other
>>>>> clustering
>>>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>>>> Hierarchical
>>>>> > > >> > > > Clustering, and Affinity Propagation.
>>>>> > > >> > > >
>>>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>>>> > > implementation,
>>>>> > > >> > > > and I saw an email on this list about interest in implementing
>>>>> > > Fuzzy
>>>>> > > >> > > > C-Means.
>>>>> > > >> > > >
>>>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>>>> > became
>>>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>>>> it
>>>>> > > would
>>>>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>>>>> > and
>>>>> > > >> > > > implement a consistent API.
>>>>> > > >> > > >
>>>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>>>> community:
>>>>> > > >> > > >
>>>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>>>> > > available?
>>>>> > > >> > > >
>>>>> > > >> > > > 2. Is the community interested in specifying a common
>>>>> framework?
>>>>> > > >> > > >
>>>>> > > >> > > > Thanks!
>>>>> > > >> > > > RJ
>>>>> > > >> > > >
>>>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>>>> > > >> > > >
>>>>> > > >> > > >
>>>>> > > >> > > > --
>>>>> > > >> > > > em rnowling@gmail.com
>>>>> > > >> > > > c 954.496.2314
>>>>> > > >> > > >
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > >
>>>>> > > >> > > --
>>>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> > > >> > >
>>>>> > > >> >
>>>>> > > >>
>>>>> > > >>
>>>>> > > >>
>>>>> > > >> --
>>>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> > > >>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > > em rnowling@gmail.com
>>>>> > > c 954.496.2314
>>>>> > >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> --
>>> em rnowling@gmail.com
>>> c 954.496.2314
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

I went ahead and created JIRAs.

JIRA for Hierarchical Clustering:
https://issues.apache.org/jira/browse/SPARK-2429

JIRA for Standarized Clustering APIs:
https://issues.apache.org/jira/browse/SPARK-2430

Before submitting a PR for the standardized API, I want to implement a
few clustering algorithms for myself to get a good "feel" for how to
structure such an API.  If others with more experience want to dive
into designing the API, though, that would allow us to get moving more
quickly.


On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath <ni...@gmail.com> wrote:
> Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great.
>
>
> I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations.
> —
> Sent from Mailbox
>
> On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Thanks everyone for the input.
>> So it seems what people want is:
>> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
>> conquer approach, look at DecisionTree implementation as a reference)
>> * Restructure 3 Kmeans clustering algorithm implementations to prevent
>> code duplication and conform to a consistent API where possible
>> If this is correct, I'll start work on that.  How would it be best to
>> structure it? Should I submit separate JIRAs / PRs for refactoring of
>> current code, MiniBatch KMeans, and Hierarchical or keep my current
>> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
>> for Hierarchical KMeans that builds on top of that?
>> Thanks!
>> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <he...@gmail.com> wrote:
>>> Yeah if one were to replace the objective function in decision tree with
>>> minimizing the variance of the leaf nodes it would be a hierarchical
>>> clusterer.
>>>
>>>
>>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <ev...@gmail.com>
>>> wrote:
>>>
>>>> If you're thinking along these lines, have a look at the DecisionTree
>>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>>> multiple passes over the data by computing several splits at each level of
>>>> tree building. The tradeoff is increased model state and computation per
>>>> pass over the data, but fewer total passes and hopefully lower
>>>> communication overheads than, say, shuffling data around that belongs to
>>>> one cluster or another. Something like that could work here as well.
>>>>
>>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>>> efficient way to implement it, though.
>>>>
>>>>
>>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:
>>>>
>>>> > No was thinking more top-down:
>>>> >
>>>> > assuming a distributed kmeans system already existing, recursively apply
>>>> > the kmeans algorithm on data already partitioned by the previous level of
>>>> > kmeans.
>>>> >
>>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>>> > because they assume there is already a distance metric for items to
>>>> items.
>>>> > This makes it hard to cluster new content. The distances between sibling
>>>> > clusters is also hard to compute (if you have thrown away the similarity
>>>> > matrix), do you count paths to same parent node if you are computing
>>>> > distances between items in two adjacent nodes for example. It is also a
>>>> bit
>>>> > harder to distribute the computation for bottom up approaches as you have
>>>> > to already find the nearest neighbor to an item to begin the process.
>>>> >
>>>> >
>>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>> >
>>>> > > The scikit-learn implementation may be of interest:
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>>> > >
>>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>>> > > chosen to minimize variance.
>>>> > >
>>>> > > Their code is under a BSD license so it can be used as a template.
>>>> > >
>>>> > > Is something like that you were thinking Hector?
>>>> > >
>>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>>> > > wrote:
>>>> > > > sure. more interesting problem here is choosing k at each level.
>>>> Kernel
>>>> > > > methods seem to be most promising.
>>>> > > >
>>>> > > >
>>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
>>>> > wrote:
>>>> > > >
>>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>>> > k-means
>>>> > > >> again on each cluster.
>>>> > > >>
>>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>>> > some
>>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>>> > with
>>>> > > >> negative dot or cosine may help.
>>>> > > >>
>>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>>> > why I
>>>> > > >> like hierarchical clustering is it makes for faster inference
>>>> > especially
>>>> > > >> over billions of users.
>>>> > > >>
>>>> > > >>
>>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>>> >
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>>> > > thanks.
>>>> > > >> >
>>>> > > >> >
>>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
>>>> > > wrote:
>>>> > > >> >
>>>> > > >> > > I would say for bigdata applications the most useful would be
>>>> > > >> > hierarchical
>>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>>> > > >> > centroids.
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
>>>> >
>>>> > > >> wrote:
>>>> > > >> > >
>>>> > > >> > > > Hi all,
>>>> > > >> > > >
>>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>>> > > KMeans.
>>>> > > >> > > > It would benefit from having implementations of other
>>>> clustering
>>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>>> Hierarchical
>>>> > > >> > > > Clustering, and Affinity Propagation.
>>>> > > >> > > >
>>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>>> > > implementation,
>>>> > > >> > > > and I saw an email on this list about interest in implementing
>>>> > > Fuzzy
>>>> > > >> > > > C-Means.
>>>> > > >> > > >
>>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>>> > became
>>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>>> it
>>>> > > would
>>>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>>>> > and
>>>> > > >> > > > implement a consistent API.
>>>> > > >> > > >
>>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>>> community:
>>>> > > >> > > >
>>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>>> > > available?
>>>> > > >> > > >
>>>> > > >> > > > 2. Is the community interested in specifying a common
>>>> framework?
>>>> > > >> > > >
>>>> > > >> > > > Thanks!
>>>> > > >> > > > RJ
>>>> > > >> > > >
>>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>>> > > >> > > >
>>>> > > >> > > >
>>>> > > >> > > > --
>>>> > > >> > > > em rnowling@gmail.com
>>>> > > >> > > > c 954.496.2314
>>>> > > >> > > >
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > >
>>>> > > >> > > --
>>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> > > >> > >
>>>> > > >> >
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >> --
>>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> > > >>
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > em rnowling@gmail.com
>>>> > > c 954.496.2314
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Yee Yang Li Hector <http://google.com/+HectorYee>
>>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> --
>> em rnowling@gmail.com
>> c 954.496.2314



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Nick Pentreath <ni...@gmail.com>.

Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great.


I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thanks everyone for the input.
> So it seems what people want is:
> * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
> conquer approach, look at DecisionTree implementation as a reference)
> * Restructure 3 Kmeans clustering algorithm implementations to prevent
> code duplication and conform to a consistent API where possible
> If this is correct, I'll start work on that.  How would it be best to
> structure it? Should I submit separate JIRAs / PRs for refactoring of
> current code, MiniBatch KMeans, and Hierarchical or keep my current
> JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
> for Hierarchical KMeans that builds on top of that?
> Thanks!
> On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <he...@gmail.com> wrote:
>> Yeah if one were to replace the objective function in decision tree with
>> minimizing the variance of the leaf nodes it would be a hierarchical
>> clusterer.
>>
>>
>> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <ev...@gmail.com>
>> wrote:
>>
>>> If you're thinking along these lines, have a look at the DecisionTree
>>> implementation in MLlib. It uses the same idea and is optimized to prevent
>>> multiple passes over the data by computing several splits at each level of
>>> tree building. The tradeoff is increased model state and computation per
>>> pass over the data, but fewer total passes and hopefully lower
>>> communication overheads than, say, shuffling data around that belongs to
>>> one cluster or another. Something like that could work here as well.
>>>
>>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>>> efficient way to implement it, though.
>>>
>>>
>>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:
>>>
>>> > No was thinking more top-down:
>>> >
>>> > assuming a distributed kmeans system already existing, recursively apply
>>> > the kmeans algorithm on data already partitioned by the previous level of
>>> > kmeans.
>>> >
>>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>>> > because they assume there is already a distance metric for items to
>>> items.
>>> > This makes it hard to cluster new content. The distances between sibling
>>> > clusters is also hard to compute (if you have thrown away the similarity
>>> > matrix), do you count paths to same parent node if you are computing
>>> > distances between items in two adjacent nodes for example. It is also a
>>> bit
>>> > harder to distribute the computation for bottom up approaches as you have
>>> > to already find the nearest neighbor to an item to begin the process.
>>> >
>>> >
>>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
>>> >
>>> > > The scikit-learn implementation may be of interest:
>>> > >
>>> > >
>>> > >
>>> >
>>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>>> > >
>>> > > It's a bottom up approach.  The pair of clusters for merging are
>>> > > chosen to minimize variance.
>>> > >
>>> > > Their code is under a BSD license so it can be used as a template.
>>> > >
>>> > > Is something like that you were thinking Hector?
>>> > >
>>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> > > wrote:
>>> > > > sure. more interesting problem here is choosing k at each level.
>>> Kernel
>>> > > > methods seem to be most promising.
>>> > > >
>>> > > >
>>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
>>> > wrote:
>>> > > >
>>> > > >> No idea, never looked it up. Always just implemented it as doing
>>> > k-means
>>> > > >> again on each cluster.
>>> > > >>
>>> > > >> FWIW standard k-means with euclidean distance has problems too with
>>> > some
>>> > > >> dimensionality reduction methods. Swapping out the distance metric
>>> > with
>>> > > >> negative dot or cosine may help.
>>> > > >>
>>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>>> > why I
>>> > > >> like hierarchical clustering is it makes for faster inference
>>> > especially
>>> > > >> over billions of users.
>>> > > >>
>>> > > >>
>>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>>> >
>>> > > >> wrote:
>>> > > >>
>>> > > >> > Hector, could you share the references for hierarchical K-means?
>>> > > thanks.
>>> > > >> >
>>> > > >> >
>>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
>>> > > wrote:
>>> > > >> >
>>> > > >> > > I would say for bigdata applications the most useful would be
>>> > > >> > hierarchical
>>> > > >> > > k-means with back tracking and the ability to support k nearest
>>> > > >> > centroids.
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
>>> >
>>> > > >> wrote:
>>> > > >> > >
>>> > > >> > > > Hi all,
>>> > > >> > > >
>>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>>> > > KMeans.
>>> > > >> > > > It would benefit from having implementations of other
>>> clustering
>>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>>> Hierarchical
>>> > > >> > > > Clustering, and Affinity Propagation.
>>> > > >> > > >
>>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>>> > > implementation,
>>> > > >> > > > and I saw an email on this list about interest in implementing
>>> > > Fuzzy
>>> > > >> > > > C-Means.
>>> > > >> > > >
>>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>>> > became
>>> > > >> > > > apparent that before I implement more clustering algorithms,
>>> it
>>> > > would
>>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>>> > and
>>> > > >> > > > implement a consistent API.
>>> > > >> > > >
>>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>>> community:
>>> > > >> > > >
>>> > > >> > > > 1. Are you interested in having more clustering algorithms
>>> > > available?
>>> > > >> > > >
>>> > > >> > > > 2. Is the community interested in specifying a common
>>> framework?
>>> > > >> > > >
>>> > > >> > > > Thanks!
>>> > > >> > > > RJ
>>> > > >> > > >
>>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>>> > > >> > > >
>>> > > >> > > >
>>> > > >> > > > --
>>> > > >> > > > em rnowling@gmail.com
>>> > > >> > > > c 954.496.2314
>>> > > >> > > >
>>> > > >> > >
>>> > > >> > >
>>> > > >> > >
>>> > > >> > > --
>>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> > > >> > >
>>> > > >> >
>>> > > >>
>>> > > >>
>>> > > >>
>>> > > >> --
>>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> > > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > em rnowling@gmail.com
>>> > > c 954.496.2314
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>>> >
>>>
>>
>>
>>
>> --
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
> -- 
> em rnowling@gmail.com
> c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Thanks everyone for the input.

So it seems what people want is:

* Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
conquer approach, look at DecisionTree implementation as a reference)
* Restructure 3 Kmeans clustering algorithm implementations to prevent
code duplication and conform to a consistent API where possible

If this is correct, I'll start work on that.  How would it be best to
structure it? Should I submit separate JIRAs / PRs for refactoring of
current code, MiniBatch KMeans, and Hierarchical or keep my current
JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
for Hierarchical KMeans that builds on top of that?

Thanks!


On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee <he...@gmail.com> wrote:
> Yeah if one were to replace the objective function in decision tree with
> minimizing the variance of the leaf nodes it would be a hierarchical
> clusterer.
>
>
> On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <ev...@gmail.com>
> wrote:
>
>> If you're thinking along these lines, have a look at the DecisionTree
>> implementation in MLlib. It uses the same idea and is optimized to prevent
>> multiple passes over the data by computing several splits at each level of
>> tree building. The tradeoff is increased model state and computation per
>> pass over the data, but fewer total passes and hopefully lower
>> communication overheads than, say, shuffling data around that belongs to
>> one cluster or another. Something like that could work here as well.
>>
>> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
>> efficient way to implement it, though.
>>
>>
>> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:
>>
>> > No was thinking more top-down:
>> >
>> > assuming a distributed kmeans system already existing, recursively apply
>> > the kmeans algorithm on data already partitioned by the previous level of
>> > kmeans.
>> >
>> > I haven't been much of a fan of bottom up approaches like HAC mainly
>> > because they assume there is already a distance metric for items to
>> items.
>> > This makes it hard to cluster new content. The distances between sibling
>> > clusters is also hard to compute (if you have thrown away the similarity
>> > matrix), do you count paths to same parent node if you are computing
>> > distances between items in two adjacent nodes for example. It is also a
>> bit
>> > harder to distribute the computation for bottom up approaches as you have
>> > to already find the nearest neighbor to an item to begin the process.
>> >
>> >
>> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
>> >
>> > > The scikit-learn implementation may be of interest:
>> > >
>> > >
>> > >
>> >
>> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>> > >
>> > > It's a bottom up approach.  The pair of clusters for merging are
>> > > chosen to minimize variance.
>> > >
>> > > Their code is under a BSD license so it can be used as a template.
>> > >
>> > > Is something like that you were thinking Hector?
>> > >
>> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > > wrote:
>> > > > sure. more interesting problem here is choosing k at each level.
>> Kernel
>> > > > methods seem to be most promising.
>> > > >
>> > > >
>> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
>> > wrote:
>> > > >
>> > > >> No idea, never looked it up. Always just implemented it as doing
>> > k-means
>> > > >> again on each cluster.
>> > > >>
>> > > >> FWIW standard k-means with euclidean distance has problems too with
>> > some
>> > > >> dimensionality reduction methods. Swapping out the distance metric
>> > with
>> > > >> negative dot or cosine may help.
>> > > >>
>> > > >> Other more useful clustering would be hierarchical SVD. The reason
>> > why I
>> > > >> like hierarchical clustering is it makes for faster inference
>> > especially
>> > > >> over billions of users.
>> > > >>
>> > > >>
>> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> > > >> wrote:
>> > > >>
>> > > >> > Hector, could you share the references for hierarchical K-means?
>> > > thanks.
>> > > >> >
>> > > >> >
>> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > > I would say for bigdata applications the most useful would be
>> > > >> > hierarchical
>> > > >> > > k-means with back tracking and the ability to support k nearest
>> > > >> > centroids.
>> > > >> > >
>> > > >> > >
>> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
>> >
>> > > >> wrote:
>> > > >> > >
>> > > >> > > > Hi all,
>> > > >> > > >
>> > > >> > > > MLlib currently has one clustering algorithm implementation,
>> > > KMeans.
>> > > >> > > > It would benefit from having implementations of other
>> clustering
>> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
>> Hierarchical
>> > > >> > > > Clustering, and Affinity Propagation.
>> > > >> > > >
>> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
>> > > implementation,
>> > > >> > > > and I saw an email on this list about interest in implementing
>> > > Fuzzy
>> > > >> > > > C-Means.
>> > > >> > > >
>> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
>> > became
>> > > >> > > > apparent that before I implement more clustering algorithms,
>> it
>> > > would
>> > > >> > > > be useful to hammer out a framework to reduce code duplication
>> > and
>> > > >> > > > implement a consistent API.
>> > > >> > > >
>> > > >> > > > I'd like to gauge the interest and goals of the MLlib
>> community:
>> > > >> > > >
>> > > >> > > > 1. Are you interested in having more clustering algorithms
>> > > available?
>> > > >> > > >
>> > > >> > > > 2. Is the community interested in specifying a common
>> framework?
>> > > >> > > >
>> > > >> > > > Thanks!
>> > > >> > > > RJ
>> > > >> > > >
>> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > em rnowling@gmail.com
>> > > >> > > > c 954.496.2314
>> > > >> > > >
>> > > >> > >
>> > > >> > >
>> > > >> > >
>> > > >> > > --
>> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > > >>
>> > >
>> > >
>> > >
>> > > --
>> > > em rnowling@gmail.com
>> > > c 954.496.2314
>> > >
>> >
>> >
>> >
>> > --
>> > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> >
>>
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Hector Yee <he...@gmail.com>.

Yeah if one were to replace the objective function in decision tree with
minimizing the variance of the leaf nodes it would be a hierarchical
clusterer.


On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks <ev...@gmail.com>
wrote:

> If you're thinking along these lines, have a look at the DecisionTree
> implementation in MLlib. It uses the same idea and is optimized to prevent
> multiple passes over the data by computing several splits at each level of
> tree building. The tradeoff is increased model state and computation per
> pass over the data, but fewer total passes and hopefully lower
> communication overheads than, say, shuffling data around that belongs to
> one cluster or another. Something like that could work here as well.
>
> I'm not super-familiar with hierarchical K-Means so perhaps there's a more
> efficient way to implement it, though.
>
>
> On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:
>
> > No was thinking more top-down:
> >
> > assuming a distributed kmeans system already existing, recursively apply
> > the kmeans algorithm on data already partitioned by the previous level of
> > kmeans.
> >
> > I haven't been much of a fan of bottom up approaches like HAC mainly
> > because they assume there is already a distance metric for items to
> items.
> > This makes it hard to cluster new content. The distances between sibling
> > clusters is also hard to compute (if you have thrown away the similarity
> > matrix), do you count paths to same parent node if you are computing
> > distances between items in two adjacent nodes for example. It is also a
> bit
> > harder to distribute the computation for bottom up approaches as you have
> > to already find the nearest neighbor to an item to begin the process.
> >
> >
> > On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
> >
> > > The scikit-learn implementation may be of interest:
> > >
> > >
> > >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> > >
> > > It's a bottom up approach.  The pair of clusters for merging are
> > > chosen to minimize variance.
> > >
> > > Their code is under a BSD license so it can be used as a template.
> > >
> > > Is something like that you were thinking Hector?
> > >
> > > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > > > sure. more interesting problem here is choosing k at each level.
> Kernel
> > > > methods seem to be most promising.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
> > wrote:
> > > >
> > > >> No idea, never looked it up. Always just implemented it as doing
> > k-means
> > > >> again on each cluster.
> > > >>
> > > >> FWIW standard k-means with euclidean distance has problems too with
> > some
> > > >> dimensionality reduction methods. Swapping out the distance metric
> > with
> > > >> negative dot or cosine may help.
> > > >>
> > > >> Other more useful clustering would be hierarchical SVD. The reason
> > why I
> > > >> like hierarchical clustering is it makes for faster inference
> > especially
> > > >> over billions of users.
> > > >>
> > > >>
> > > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Hector, could you share the references for hierarchical K-means?
> > > thanks.
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > I would say for bigdata applications the most useful would be
> > > >> > hierarchical
> > > >> > > k-means with back tracking and the ability to support k nearest
> > > >> > centroids.
> > > >> > >
> > > >> > >
> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowling@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > MLlib currently has one clustering algorithm implementation,
> > > KMeans.
> > > >> > > > It would benefit from having implementations of other
> clustering
> > > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means,
> Hierarchical
> > > >> > > > Clustering, and Affinity Propagation.
> > > >> > > >
> > > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > > implementation,
> > > >> > > > and I saw an email on this list about interest in implementing
> > > Fuzzy
> > > >> > > > C-Means.
> > > >> > > >
> > > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
> > became
> > > >> > > > apparent that before I implement more clustering algorithms,
> it
> > > would
> > > >> > > > be useful to hammer out a framework to reduce code duplication
> > and
> > > >> > > > implement a consistent API.
> > > >> > > >
> > > >> > > > I'd like to gauge the interest and goals of the MLlib
> community:
> > > >> > > >
> > > >> > > > 1. Are you interested in having more clustering algorithms
> > > available?
> > > >> > > >
> > > >> > > > 2. Is the community interested in specifying a common
> framework?
> > > >> > > >
> > > >> > > > Thanks!
> > > >> > > > RJ
> > > >> > > >
> > > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > em rnowling@gmail.com
> > > >> > > > c 954.496.2314
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Yee Yang Li Hector <http://google.com/+HectorYee>
> > > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >>
> > >
> > >
> > >
> > > --
> > > em rnowling@gmail.com
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by "Evan R. Sparks" <ev...@gmail.com>.

If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass over the data, but fewer total passes and hopefully lower
communication overheads than, say, shuffling data around that belongs to
one cluster or another. Something like that could work here as well.

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.


On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee <he...@gmail.com> wrote:

> No was thinking more top-down:
>
> assuming a distributed kmeans system already existing, recursively apply
> the kmeans algorithm on data already partitioned by the previous level of
> kmeans.
>
> I haven't been much of a fan of bottom up approaches like HAC mainly
> because they assume there is already a distance metric for items to items.
> This makes it hard to cluster new content. The distances between sibling
> clusters is also hard to compute (if you have thrown away the similarity
> matrix), do you count paths to same parent node if you are computing
> distances between items in two adjacent nodes for example. It is also a bit
> harder to distribute the computation for bottom up approaches as you have
> to already find the nearest neighbor to an item to begin the process.
>
>
> On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> > The scikit-learn implementation may be of interest:
> >
> >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> >
> > It's a bottom up approach.  The pair of clusters for merging are
> > chosen to minimize variance.
> >
> > Their code is under a BSD license so it can be used as a template.
> >
> > Is something like that you were thinking Hector?
> >
> > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > > sure. more interesting problem here is choosing k at each level. Kernel
> > > methods seem to be most promising.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com>
> wrote:
> > >
> > >> No idea, never looked it up. Always just implemented it as doing
> k-means
> > >> again on each cluster.
> > >>
> > >> FWIW standard k-means with euclidean distance has problems too with
> some
> > >> dimensionality reduction methods. Swapping out the distance metric
> with
> > >> negative dot or cosine may help.
> > >>
> > >> Other more useful clustering would be hierarchical SVD. The reason
> why I
> > >> like hierarchical clustering is it makes for faster inference
> especially
> > >> over billions of users.
> > >>
> > >>
> > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hector, could you share the references for hierarchical K-means?
> > thanks.
> > >> >
> > >> >
> > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
> > wrote:
> > >> >
> > >> > > I would say for bigdata applications the most useful would be
> > >> > hierarchical
> > >> > > k-means with back tracking and the ability to support k nearest
> > >> > centroids.
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > MLlib currently has one clustering algorithm implementation,
> > KMeans.
> > >> > > > It would benefit from having implementations of other clustering
> > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > >> > > > Clustering, and Affinity Propagation.
> > >> > > >
> > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > implementation,
> > >> > > > and I saw an email on this list about interest in implementing
> > Fuzzy
> > >> > > > C-Means.
> > >> > > >
> > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
> became
> > >> > > > apparent that before I implement more clustering algorithms, it
> > would
> > >> > > > be useful to hammer out a framework to reduce code duplication
> and
> > >> > > > implement a consistent API.
> > >> > > >
> > >> > > > I'd like to gauge the interest and goals of the MLlib community:
> > >> > > >
> > >> > > > 1. Are you interested in having more clustering algorithms
> > available?
> > >> > > >
> > >> > > > 2. Is the community interested in specifying a common framework?
> > >> > > >
> > >> > > > Thanks!
> > >> > > > RJ
> > >> > > >
> > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > em rnowling@gmail.com
> > >> > > > c 954.496.2314
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Yee Yang Li Hector <http://google.com/+HectorYee>
> > >> *google.com/+HectorYee <http://google.com/+HectorYee>*
> > >>
> >
> >
> >
> > --
> > em rnowling@gmail.com
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Hector Yee <he...@gmail.com>.

No was thinking more top-down:

assuming a distributed kmeans system already existing, recursively apply
the kmeans algorithm on data already partitioned by the previous level of
kmeans.

I haven't been much of a fan of bottom up approaches like HAC mainly
because they assume there is already a distance metric for items to items.
This makes it hard to cluster new content. The distances between sibling
clusters is also hard to compute (if you have thrown away the similarity
matrix), do you count paths to same parent node if you are computing
distances between items in two adjacent nodes for example. It is also a bit
harder to distribute the computation for bottom up approaches as you have
to already find the nearest neighbor to an item to begin the process.


On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling <rn...@gmail.com> wrote:

> The scikit-learn implementation may be of interest:
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
>
> It's a bottom up approach.  The pair of clusters for merging are
> chosen to minimize variance.
>
> Their code is under a BSD license so it can be used as a template.
>
> Is something like that you were thinking Hector?
>
> On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > sure. more interesting problem here is choosing k at each level. Kernel
> > methods seem to be most promising.
> >
> >
> > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com> wrote:
> >
> >> No idea, never looked it up. Always just implemented it as doing k-means
> >> again on each cluster.
> >>
> >> FWIW standard k-means with euclidean distance has problems too with some
> >> dimensionality reduction methods. Swapping out the distance metric with
> >> negative dot or cosine may help.
> >>
> >> Other more useful clustering would be hierarchical SVD. The reason why I
> >> like hierarchical clustering is it makes for faster inference especially
> >> over billions of users.
> >>
> >>
> >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>
> >> > Hector, could you share the references for hierarchical K-means?
> thanks.
> >> >
> >> >
> >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
> wrote:
> >> >
> >> > > I would say for bigdata applications the most useful would be
> >> > hierarchical
> >> > > k-means with back tracking and the ability to support k nearest
> >> > centroids.
> >> > >
> >> > >
> >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> >> > > > It would benefit from having implementations of other clustering
> >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> >> > > > Clustering, and Affinity Propagation.
> >> > > >
> >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> >> > > > and I saw an email on this list about interest in implementing
> Fuzzy
> >> > > > C-Means.
> >> > > >
> >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> >> > > > apparent that before I implement more clustering algorithms, it
> would
> >> > > > be useful to hammer out a framework to reduce code duplication and
> >> > > > implement a consistent API.
> >> > > >
> >> > > > I'd like to gauge the interest and goals of the MLlib community:
> >> > > >
> >> > > > 1. Are you interested in having more clustering algorithms
> available?
> >> > > >
> >> > > > 2. Is the community interested in specifying a common framework?
> >> > > >
> >> > > > Thanks!
> >> > > > RJ
> >> > > >
> >> > > > [1] - https://github.com/apache/spark/pull/1248
> >> > > >
> >> > > >
> >> > > > --
> >> > > > em rnowling@gmail.com
> >> > > > c 954.496.2314
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> >> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Yee Yang Li Hector <http://google.com/+HectorYee>
> >> *google.com/+HectorYee <http://google.com/+HectorYee>*
> >>
>
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

The scikit-learn implementation may be of interest:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward

It's a bottom up approach.  The pair of clusters for merging are
chosen to minimize variance.

Their code is under a BSD license so it can be used as a template.

Is something like that you were thinking Hector?

On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com> wrote:
>
>> No idea, never looked it up. Always just implemented it as doing k-means
>> again on each cluster.
>>
>> FWIW standard k-means with euclidean distance has problems too with some
>> dimensionality reduction methods. Swapping out the distance metric with
>> negative dot or cosine may help.
>>
>> Other more useful clustering would be hierarchical SVD. The reason why I
>> like hierarchical clustering is it makes for faster inference especially
>> over billions of users.
>>
>>
>> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>> > Hector, could you share the references for hierarchical K-means? thanks.
>> >
>> >
>> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com> wrote:
>> >
>> > > I would say for bigdata applications the most useful would be
>> > hierarchical
>> > > k-means with back tracking and the ability to support k nearest
>> > centroids.
>> > >
>> > >
>> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > MLlib currently has one clustering algorithm implementation, KMeans.
>> > > > It would benefit from having implementations of other clustering
>> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
>> > > > Clustering, and Affinity Propagation.
>> > > >
>> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
>> > > > and I saw an email on this list about interest in implementing Fuzzy
>> > > > C-Means.
>> > > >
>> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
>> > > > apparent that before I implement more clustering algorithms, it would
>> > > > be useful to hammer out a framework to reduce code duplication and
>> > > > implement a consistent API.
>> > > >
>> > > > I'd like to gauge the interest and goals of the MLlib community:
>> > > >
>> > > > 1. Are you interested in having more clustering algorithms available?
>> > > >
>> > > > 2. Is the community interested in specifying a common framework?
>> > > >
>> > > > Thanks!
>> > > > RJ
>> > > >
>> > > > [1] - https://github.com/apache/spark/pull/1248
>> > > >
>> > > >
>> > > > --
>> > > > em rnowling@gmail.com
>> > > > c 954.496.2314
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Yee Yang Li Hector <http://google.com/+HectorYee>
>> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
>> > >
>> >
>>
>>
>>
>> --
>> Yee Yang Li Hector <http://google.com/+HectorYee>
>> *google.com/+HectorYee <http://google.com/+HectorYee>*
>>



-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Hector Yee <he...@gmail.com>.

K doesn't matter much I've tried anything from 2^10 to 10^3 and the
performance
doesn't change much as measured by precision @ K. (see table 1
http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3
kmeans did outperform 2^10 hierarchical SVD slightly in terms of the
metrics, 2^10 SVD was much faster in terms of inference time.

I found the thing that affected performance most was adding in back
tracking to fix mistakes made at higher levels rather than how the K is
picked per level.



On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> sure. more interesting problem here is choosing k at each level. Kernel
> methods seem to be most promising.
>
>
> On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com> wrote:
>
> > No idea, never looked it up. Always just implemented it as doing k-means
> > again on each cluster.
> >
> > FWIW standard k-means with euclidean distance has problems too with some
> > dimensionality reduction methods. Swapping out the distance metric with
> > negative dot or cosine may help.
> >
> > Other more useful clustering would be hierarchical SVD. The reason why I
> > like hierarchical clustering is it makes for faster inference especially
> > over billions of users.
> >
> >
> > On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Hector, could you share the references for hierarchical K-means?
> thanks.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com>
> wrote:
> > >
> > > > I would say for bigdata applications the most useful would be
> > > hierarchical
> > > > k-means with back tracking and the ability to support k nearest
> > > centroids.
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > MLlib currently has one clustering algorithm implementation,
> KMeans.
> > > > > It would benefit from having implementations of other clustering
> > > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > > Clustering, and Affinity Propagation.
> > > > >
> > > > > I recently submitted a PR [1] for a MiniBatch KMeans
> implementation,
> > > > > and I saw an email on this list about interest in implementing
> Fuzzy
> > > > > C-Means.
> > > > >
> > > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > > apparent that before I implement more clustering algorithms, it
> would
> > > > > be useful to hammer out a framework to reduce code duplication and
> > > > > implement a consistent API.
> > > > >
> > > > > I'd like to gauge the interest and goals of the MLlib community:
> > > > >
> > > > > 1. Are you interested in having more clustering algorithms
> available?
> > > > >
> > > > > 2. Is the community interested in specifying a common framework?
> > > > >
> > > > > Thanks!
> > > > > RJ
> > > > >
> > > > > [1] - https://github.com/apache/spark/pull/1248
> > > > >
> > > > >
> > > > > --
> > > > > em rnowling@gmail.com
> > > > > c 954.496.2314
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > > >
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.


On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <he...@gmail.com> wrote:

> No idea, never looked it up. Always just implemented it as doing k-means
> again on each cluster.
>
> FWIW standard k-means with euclidean distance has problems too with some
> dimensionality reduction methods. Swapping out the distance metric with
> negative dot or cosine may help.
>
> Other more useful clustering would be hierarchical SVD. The reason why I
> like hierarchical clustering is it makes for faster inference especially
> over billions of users.
>
>
> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Hector, could you share the references for hierarchical K-means? thanks.
> >
> >
> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com> wrote:
> >
> > > I would say for bigdata applications the most useful would be
> > hierarchical
> > > k-means with back tracking and the ability to support k nearest
> > centroids.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > > It would benefit from having implementations of other clustering
> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > Clustering, and Affinity Propagation.
> > > >
> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > > and I saw an email on this list about interest in implementing Fuzzy
> > > > C-Means.
> > > >
> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > apparent that before I implement more clustering algorithms, it would
> > > > be useful to hammer out a framework to reduce code duplication and
> > > > implement a consistent API.
> > > >
> > > > I'd like to gauge the interest and goals of the MLlib community:
> > > >
> > > > 1. Are you interested in having more clustering algorithms available?
> > > >
> > > > 2. Is the community interested in specifying a common framework?
> > > >
> > > > Thanks!
> > > > RJ
> > > >
> > > > [1] - https://github.com/apache/spark/pull/1248
> > > >
> > > >
> > > > --
> > > > em rnowling@gmail.com
> > > > c 954.496.2314
> > > >
> > >
> > >
> > >
> > > --
> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > >
> >
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Hector Yee <he...@gmail.com>.

No idea, never looked it up. Always just implemented it as doing k-means
again on each cluster.

FWIW standard k-means with euclidean distance has problems too with some
dimensionality reduction methods. Swapping out the distance metric with
negative dot or cosine may help.

Other more useful clustering would be hierarchical SVD. The reason why I
like hierarchical clustering is it makes for faster inference especially
over billions of users.


On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Hector, could you share the references for hierarchical K-means? thanks.
>
>
> On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com> wrote:
>
> > I would say for bigdata applications the most useful would be
> hierarchical
> > k-means with back tracking and the ability to support k nearest
> centroids.
> >
> >
> > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > It would benefit from having implementations of other clustering
> > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > Clustering, and Affinity Propagation.
> > >
> > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > and I saw an email on this list about interest in implementing Fuzzy
> > > C-Means.
> > >
> > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > apparent that before I implement more clustering algorithms, it would
> > > be useful to hammer out a framework to reduce code duplication and
> > > implement a consistent API.
> > >
> > > I'd like to gauge the interest and goals of the MLlib community:
> > >
> > > 1. Are you interested in having more clustering algorithms available?
> > >
> > > 2. Is the community interested in specifying a common framework?
> > >
> > > Thanks!
> > > RJ
> > >
> > > [1] - https://github.com/apache/spark/pull/1248
> > >
> > >
> > > --
> > > em rnowling@gmail.com
> > > c 954.496.2314
> > >
> >
> >
> >
> > --
> > Yee Yang Li Hector <http://google.com/+HectorYee>
> > *google.com/+HectorYee <http://google.com/+HectorYee>*
> >
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Hector, could you share the references for hierarchical K-means? thanks.


On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <he...@gmail.com> wrote:

> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com> wrote:
>
> > Hi all,
> >
> > MLlib currently has one clustering algorithm implementation, KMeans.
> > It would benefit from having implementations of other clustering
> > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > Clustering, and Affinity Propagation.
> >
> > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > and I saw an email on this list about interest in implementing Fuzzy
> > C-Means.
> >
> > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > apparent that before I implement more clustering algorithms, it would
> > be useful to hammer out a framework to reduce code duplication and
> > implement a consistent API.
> >
> > I'd like to gauge the interest and goals of the MLlib community:
> >
> > 1. Are you interested in having more clustering algorithms available?
> >
> > 2. Is the community interested in specifying a common framework?
> >
> > Thanks!
> > RJ
> >
> > [1] - https://github.com/apache/spark/pull/1248
> >
> >
> > --
> > em rnowling@gmail.com
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Hector Yee <he...@gmail.com>.

I would say for bigdata applications the most useful would be hierarchical
k-means with back tracking and the ability to support k nearest centroids.


On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> MLlib currently has one clustering algorithm implementation, KMeans.
> It would benefit from having implementations of other clustering
> algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> Clustering, and Affinity Propagation.
>
> I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> and I saw an email on this list about interest in implementing Fuzzy
> C-Means.
>
> Based on Sean Owen's review of my MiniBatch KMeans code, it became
> apparent that before I implement more clustering algorithms, it would
> be useful to hammer out a framework to reduce code duplication and
> implement a consistent API.
>
> I'd like to gauge the interest and goals of the MLlib community:
>
> 1. Are you interested in having more clustering algorithms available?
>
> 2. Is the community interested in specifying a common framework?
>
> Thanks!
> RJ
>
> [1] - https://github.com/apache/spark/pull/1248
>
>
> --
> em rnowling@gmail.com
> c 954.496.2314
>



-- 
Yee Yang Li Hector <http://google.com/+HectorYee>
*google.com/+HectorYee <http://google.com/+HectorYee>*

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by RJ Nowling <rn...@gmail.com>.

Hi Yu,

A standardized API has not been implemented yet.  I think it would be
better to implement the other clustering algorithms then extract a common
API.  Others may feel differently.  :)

Just a note, there was a pre-existing JIRA for hierarchical KMeans
SPARK-2429 <https://issues.apache.org/jira/browse/SPARK-2429> I filed.  I
added a comment about previous discussion on the mailing list, example code
provided by a Jeremy Freeman, and a couple of papers I found.

Feel free to take this over -- I've played with it but haven't had time to
finish it.  I'd be happy to review the resulting code and discuss
approaches with you.

RJ

On Wed, Aug 13, 2014 at 9:20 AM, Yu Ishikawa <yu...@gmail.com>
wrote:

> Hi all,
>
> I am also interested in specifying a common framework.
> And I am trying to implement a hierarchical k-means and a hierarchical
> clustering like single-link method with LSH.
> https://issues.apache.org/jira/browse/SPARK-2966
>
> If you have designed the standardized clustering algorithms API, please let
> me know.
>
>
> best,
> Yu Ishikawa
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

-- 
em rnowling@gmail.com
c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Posted by Yu Ishikawa <yu...@gmail.com>.

Hi all,

I am also interested in specifying a common framework.
And I am trying to implement a hierarchical k-means and a hierarchical
clustering like single-link method with LSH.
https://issues.apache.org/jira/browse/SPARK-2966

If you have designed the standardized clustering algorithms API, please let
me know.


best,
Yu Ishikawa



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org