You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/09/30 23:34:42 UTC

[jira] Created: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
------------------------------------------------------------------

                 Key: MAHOUT-518
                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.4
            Reporter: Jeff Eastman
             Fix For: 0.5


The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918578#action_12918578 ] 

Shannon Quinn commented on MAHOUT-518:
--------------------------------------

Some of the examples I've seen of generalized spectral clustering use points in two-dimensional space and generate affinities between them. In theory there's no issue with this; the only problem is you can easily imagine situations where the data are non-symmetric (i.e. the KNN of one point has a member which does not contain the original point in its KNN), so yes the only way to guarantee symmetry is to compute the affinity of each point with every other point, and that clearly isn't scalable. A distance threshold would work much better - something more along the lines of density estimation?

> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918273#action_12918273 ] 

Jeff Eastman commented on MAHOUT-518:
-------------------------------------

Mean Shift is used in image processing too, but has shown itself to work pretty well on other vector clustering applications. I wonder if spectral clustering can be also? I see that the affinity preprocessing of n, arbitrary vectors might produce a dense nxn matrix if we used a DistanceMeasure as the affinity measure and this would clearly not scale. There are also scalability problems with needing to compare each point with every other as in Mean Shift. But the addition of a distance threshold, similar to T1 & T2 for Canopy could allow a distance-measure-based preprocessor to produce affinity matrices that were both square and sparse. It might just be GIGO, but it would allow the spectral clustering jobs to consume arbitrary vectors like most of the other clustering jobs. 

> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918593#action_12918593 ] 

Jeff Eastman commented on MAHOUT-518:
-------------------------------------

Sort of. The density estimators both use a set of representative points taken from the clustered points output after clustering. But using a threshold to force large distance measures to product zero affinities - instead of just small affinities - would make the A matrix sparse again and allow subsequent processing to scale better. Even with such a threshold; however, the need to compare each point with every other would make it tricky to do at scale. I can imagine some sort of mapper-side join in a custom InputFormat would be required as in DRM.

> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917263#action_12917263 ] 

Shannon Quinn commented on MAHOUT-518:
--------------------------------------

This is worth discussion: Eigencuts (and spectral clustering algorithms on a whole) is designed to work specifically on images, but in theory can be used for any general-purpose clustering if the data format is correct. The primary issue here, however, is that the input affinity matrix A currently must be symmetrical (and sparse, but that's the Mahout requirement). The symmetry is easy to do with images: the general rule of thumb is that for each pixel, the neighborhood of affinities consists of the 8 pixels around it, therefore making it both sparse and symmetric. Were these data points to be drawn from arbitrary distributions (say, a bunch of points in Euclidean space), you can picture instances where the neighborhoods of nearest data points aren't symmetric.

There are optimizations that can be made to convert non-symmetric input data into lower-rank approximations that are fully symmetric, but that's probably something we should tackle later (there's a section on this problem specifically in Dr. Chennubhotla's thesis containing Eigencuts, that's probably a good place to start). My recommendation for this ticket is to allow the algorithm to process raw images; generalized input data (which is not necessarily symmetric) can come later.

My only point regarding images is that for most academic purposes, this algorithm has used input images in PGM format; problem is, Java doesn't have a native PGM image processor, hence why I'm still tweaking the Eigencuts examples. If anyone knows of something that would help with this, please let me know. 

> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917268#action_12917268 ] 

Ted Dunning commented on MAHOUT-518:
------------------------------------

Java Advanced Imaging can read PNM formatted images with includes PGM (and PBM and PPM) formatted images.

See 
http://java.sun.com/products/java-media/jai/iio.html
and
http://java.sun.com/javase/technologies/desktop/media/jai/project/index.html
and
https://jai-imageio.dev.java.net/

Sanselan also supports reading images in PGM format.  

I don't know how much we want to add a dependency just for a demo program for eigenCuts.


> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-518) Implement Affinity Preprocessing for Eigencuts and Spectral KMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917861#action_12917861 ] 

Shannon Quinn commented on MAHOUT-518:
--------------------------------------

That's a good point, and these images can always be converted to a friendly format for the example. However, if we want to support raw input we may still have to incorporate raw image preprocessing into Eigencuts, which will likely require JAI. Unless there's an intermediate format we could support instead? I'm totally open to ideas on this one.

> Implement Affinity Preprocessing for Eigencuts and Spectral KMeans
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-518
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> The input format for these clustering algorithms is currently affinity tuples. It would be very nice to have this process automated. Marking for 0.5 as this will require some investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.