You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shannon Quinn (JIRA)" <ji...@apache.org> on 2010/04/05 06:21:27 UTC

[jira] Created: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
------------------------------------------------------------------

Key: MAHOUT-363
URL: https://issues.apache.org/jira/browse/MAHOUT-363
Project: Mahout
Issue Type: Task
Reporter: Shannon Quinn

Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: magsol@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc).

I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856229#action_12856229 ] 

Shannon Quinn commented on MAHOUT-363:
--------------------------------------

I've been getting lot of good feedback, and after discussing my proposal at length with a few people, I'll submitting changes to the proposal timeline, particularly regarding the k-means implementation.

Essentially, I'll cut out what is currently sprint 1, shifting sprints 2-4 back two weeks. I'll start immediately on implementing the EigenCuts algorithm, relying heavily on test-drive development and full-coverage unit testing to highlight any bugs in the EigenCuts implementation. Using the extra two weeks, I'll focus on providing more thorough documentation, including detailed specs on how to use the algorithm: given that I am working with several PhD students and professors at CMU who are interested in a map/reduce implementation of this algorithm, they have massive datasets that are readily available for use. I'll post an example dataset, along with instructions on how to feed it to Mahout, and what the output should look like.

If there is extra time at the end, I can work on the user interface and data visualization.

The schedule aside, there are already a few points of the implementation that I can probably address in slightly greater detail. Regarding the decomposition of the similarity matrix and computation of eigenvectors, this is the computationally difficult part and can be done several different ways. A stochastic decomposition, as proposed in http://arxiv.org/abs/0909.4061v1 (via Ted Dunning) would be one of several approaches, though from a performance perspective this looks to be one of the best. For the mechanics of parallelizing EigenCuts, the paper from ECML/PKDD 2008, "Parallel spectral clustering" (via Isabel Drost) provides an excellent starting point, in addition to using Mahout's existing map/reduce eigensolver.

I hope this helps clarify some points and strengthen the overall feasibility of the proposal.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855061#action_12855061 ] 

Shannon Quinn commented on MAHOUT-363:
--------------------------------------

I apologize for that. I actually only changed some of the wording; the overall proposal structure wasn't changed. But I will certainly refrain from editing the ticket itself.

Are there any other suggestions for making the proposal more viable?

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854950#action_12854950 ] 

Jake Mannix commented on MAHOUT-363:
------------------------------------

If possible, Shannon, if you could simply add comments to this JIRA ticket, instead of editing the original ticket itself, we'll be able to more easily follow your thinking.  Otherwise, we can't really see what has changed.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853303#action_12853303 ] 

Ted Dunning commented on MAHOUT-363:
------------------------------------


Nice proposal.  Well written and well conceived.

One tiny suggestion is to omit Hama from your plan as it would just be a distraction for you.  The Hama project is pretty much independent of Mahout and there hasn't any contribution in the H->M direction.  



> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853415#action_12853415 ] 

Shannon Quinn commented on MAHOUT-363:
--------------------------------------

Hi Robin, thanks for the suggestions! I have started playing with the code (though I should read the wiki more, excellent point), and my rationale behind the k-means implementation was to build an initial framework for the EigenCuts algorithm, around which I could build a UI that accepted input and displayed output, and also to tap into the map/reduce framework. Both would work alongside a very simple clustering technique that would be easily implemented. Essentially, the goal of the first sprint is to get the UI and map/reduce integration running; the k-means algorithm is merely a sort of test condition for those two features, given its ease of implementation.

That's just my explanation; if you feel otherwise I'm happy to adjust my proposal :)

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853409#action_12853409 ] 

Robin Anil commented on MAHOUT-363:
-----------------------------------

Hi Shannon, did you take time to explore the Mahout code. I believe the k-means you are looking to implement is already there it will shave 2 weeks of your GSOC :). Reading the code/wiki is a great exercise for you to be more realistic in your proposal

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853308#action_12853308 ] 

Jake Mannix commented on MAHOUT-363:
------------------------------------

... and actually, there is no need for Hama, as distributed matrix computations can already be done in Mahout natively (in particular: finding eigenvectors and eigenvalues is already done as a Map/Reduce job).

This proposal seems to be in exactly the right level of additional work and utilization of our current set of primitive operations for a GSoC project.  I wish I had the time to help with mentoring this project, in fact.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-363:
---------------------------------

    Description: 
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: magsol@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

  was:
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: magsol@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.


> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855068#action_12855068 ] 

Ted Dunning commented on MAHOUT-363:
------------------------------------

{quote}
Are there any other suggestions for making the proposal more viable?
{quote}
I think that the proposal as it stands is very strong.

When and if it comes down to doing the work, I will be asking questions like whether it is possible to take several eigenvectors at a time instead of just one, and how this work relates to the stochastic decomposition work described at http://arxiv.org/abs/0909.4061v1 .  In particular, I wonder if you could re-use the the randomized product in doing subsequent decompositions.

But these kinds of questions aren't necessarily the kind of thing that you need to answer in the proposal ... they are just how you get side-tracked after you start.  :-)



> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-363:
---------------------------------

    Description: 
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: magsol@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

  was:
Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: magsol@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.


> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853306#action_12853306 ] 

Shannon Quinn commented on MAHOUT-363:
--------------------------------------

Thank you for the feedback!

Cutting out Hama would certainly improve the feasibility of the project timeline and allow me to further refine the overall algorithm. I will absolutely adhere to your advice; I'll edit this ticket and my GSoC application. Thank you again!

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-363
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-363
>             Project: Mahout
>          Issue Type: Task
>            Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: magsol@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. He has thus far been instrumental in highlighting the finer points of the underlying theory, and coupled with my experience in and knowledge of software engineering, this is a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.
> Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).
> Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.
> Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Hama for calculating of transition matrices and associated eigenvectors and eigenvalues.
> Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.
> Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.
> Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.
> Other Information
> I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc). 
> I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.
> References
> [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.