You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2010/12/12 04:36:01 UTC

[jira] Created: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

CanopyEstimator - Estimate T1/T2 for CanopyClusterer
----------------------------------------------------

                 Key: MAHOUT-563
                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Lance Norskog
            Priority: Minor


Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991977#comment-12991977 ] 

Sean Owen commented on MAHOUT-563:
----------------------------------

Couple other things... careful about Double.MIN_VALUE. It's not the most negative value -- it's actually the smallest positive double. Double.NEGATIVE_INFINITY is the right one.

t1Found/t2Found are set in setStarterValues(), but not used in estimate?

Also it's possible that subSample() never terminates if the input had any nulls.

We might want to back up and discuss this a little more...

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992244#comment-12992244 ] 

Sean Owen commented on MAHOUT-563:
----------------------------------

It's superseded by another commit? or it should go somewhere but utils/?
I assume the algorithms is fine enough to commit as a start. Anyone know of a better way, or is this already done?

I think the open question is more what behavior you want. Should those T1/T2 values be output, but not the canopies?

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992942#comment-12992942 ] 

Lance Norskog commented on MAHOUT-563:
--------------------------------------

The estimator samples 20 points and tries to guess T1/T2 from that subset. Yes, "all distances" is not a scalable technique.
bq. Uhh... this isn't a graph.
Yup, brain-hiccup. 





> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992946#comment-12992946 ] 

Sean Owen commented on MAHOUT-563:
----------------------------------

OK, well perhaps post an updated patch as and when you like, incorporating the feedback in this thread. Or perhaps Ted's right and its applicability is somewhat limited.

Regarding my comment, "Should those T1/T2 values be output, but not the canopies?" -- my point is that the object which is to collect the canopies is never filled out. Nothing puts anything into it. Maybe that's just a typo; may want to look at a number of small things like that.

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991976#comment-12991976 ] 

Sean Owen commented on MAHOUT-563:
----------------------------------

I think this could be fine for utils/. There are some style issues with the patch (System.out, no copyright, some unused variables, spacing, braces) but easily fixed locally. I might sprinkle some javadoc on it too.

I have a few questions:

The variable canopies is never filled out. Should it be removed? Or is this supposed to be filled out in estimate()?
There are some constant-looking local variables in estimate(). Can I make them constants?



> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Assigned: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen reassigned MAHOUT-563:
--------------------------------

    Assignee: Sean Owen

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992776#comment-12992776 ] 

Lance Norskog commented on MAHOUT-563:
--------------------------------------

bq. I assume the algorithms is fine enough to commit as a start. Anyone know of a better way, or is this already done?
The algorithm is a placeholder. Studying the problem again, I think the algorithm should minimize overlaps.
Here's a possibility: use Floyd's Algorithm to find all of the distances between points. Then, make a histogram of the distances and choose T1 and T2 from different percentiles.
[Floyd-Warshall-Roy Algorithm|http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm]
bq. or it should go somewhere but utils/?
It is not intended as a separate utility program. You would use it in code directly. The 'make a canopy' job would use this by default, and then add an option to let you specify T1/T2.
bq. Should those T1/T2 values be output, but not the canopies?
This is the point of having an object that stashes them all. The algorithm explicitly subsamples the data merely to get the distances. It is possible that the generated canopies will be good enough for many applications, so some way to pull them out is needed. Also it should store the sampled vectors- those can be reused for other purposes.


> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Issue Comment Edited: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992942#comment-12992942 ] 

Lance Norskog edited comment on MAHOUT-563 at 2/10/11 9:03 AM:
---------------------------------------------------------------

Clearly one estimator does not fit all datasets. Is it worth pursuing different estimators?

bq. Uhh... this isn't a graph.
Yup, brain-hiccup. 





      was (Author: lancenorskog):
    The estimator samples 20 points and tries to guess T1/T2 from that subset. Yes, "all distances" is not a scalable technique.
bq. Uhh... this isn't a graph.
Yup, brain-hiccup. 




  
> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-563.
------------------------------

    Resolution: Won't Fix
      Assignee:     (was: Sean Owen)

If I'm not wrong, this didn't quite reach a conclusion. We can reopen if there is a new and concrete patch available which addresses the issue, and previous comments from committers.

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated MAHOUT-563:
---------------------------------

    Attachment: MAHOUT-563.patch

First cut. 

CanopyEstimator does three things: 
# Makes a random subsample of a given vector set
# Estimated initial T1/T2 numbers from the maximum and minimum distances between vectors in the subsample
# Iteratively hunts for T1/T2 values will create a list of canopies within a given minimum and maximum size

The algorithm for the latter is lame, but it does something useful in small and medium-scale tests.

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992794#comment-12992794 ] 

Ted Dunning commented on MAHOUT-563:
------------------------------------

{quote}
Here's a possibility: use Floyd's Algorithm to find all of the distances between points. Then, make a histogram of the distances and choose T1 and T2 from different percentiles.
Floyd-Warshall-Roy Algorithm
{quote}

Uhh... this isn't a graph.

And we try not to do anything that starts "all distances between all points" because that way lies madness in terms of scalability.

Picking a thousand pairs of points and using those for your histogram might work.  At least you would get a reasonable approximation of the actual distance histogram.

My guess is that the zone of applicability for a simple heuristic like this will be kind of narrow.  In my experience I have seen data that looks like stuff on a hyper sphere.  This has all distances clustered around a single value but can still often be clustered well.  I have also seen data that has power law membership of clusters and the clusters have very different sizes.  In that case, the histogram is likely to show only the large clusters.

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-563) CanopyEstimator - Estimate T1/T2 for CanopyClusterer

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992239#comment-12992239 ] 

Lance Norskog commented on MAHOUT-563:
--------------------------------------

Yes, this was intended as a skeleton for a better algorithm. Someone recently posted some better code; this is where it would go.

Aside from the dummy algorithm, does the whole thing make sense? If it works, would we add it to the 'canopy' jobs?

> CanopyEstimator - Estimate T1/T2 for CanopyClusterer
> ----------------------------------------------------
>
>                 Key: MAHOUT-563
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-563
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-563.patch
>
>
> Hunting for T1/T2 values that make an interesting Canopy set is a singularly unsatisfying task. This class estimates T1 and T2 numbers given the original set.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira