You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@farfetchers.com> on 2012/05/10 02:36:15 UTC

Canopy estimator

Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563

Did anything ever get done with this? Ted mentions limited usefulness. 
This may be true but the cases he mentions as counter examples are also 
not very good for using canopy ahead of kmeans, no? That info would be a 
useful result. To use canopies I find myself running it over and over 
trying to see some inflection in the number of clusters. Why not 
automate this? Even if the data shows nothing, that is itself an answer 
of value and it would save a lot of hand work to find out the same thing.

Re: Canopy estimator

Posted by Ted Dunning <te...@gmail.com>.

Roughly.

But it also gives you a small-ish surrogate for your data that would let
you use all kinds of different clustering methods since the surrogate fits
in memory.

On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> This why canopy has been frustrating because by varying t I would have
> hoped to generate these levels of specificity, then replace hierarchical
> clustering with a similarity measure. In other words L1 has 1000 docs per
> cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest to L1
> clusters (that's all the user wants to see in my case) and reference the 10
> L2 clusters nearest by centroid similarity using rowsimilarity to
> calculate. I'm hoping that this is a useful way to browse the information
> space.
>
> Naively speaking your streaming k seems to have elements of this built in.
>

Re: Canopy estimator

Posted by Pat Ferrel <pa...@occamsmachete.com>.

As I said in another reply my data doesn't seem to cluster well contrary 
to my intuition. I may have a data problem or need to do some 
dimensional reduction. I've skimmed your docs and would love to give it 
a try. With 10x or 100x faster results alone it would be a big help. It 
seems to promise doing away with the canopy step all together, no? But I 
need to read your docs more carefully.

Ultimately I'm trying to find a good way to automatically generate 
several levels of clustering with varying "specificity". Each level 
might be completely independent of the other but the clusters should be 
more and more specific. They could be hierarchical but it might be 
better to just find the nearest clusters from less specific to more 
specific. From a high number of doc members to a low number.

This why canopy has been frustrating because by varying t I would have 
hoped to generate these levels of specificity, then replace hierarchical 
clustering with a similarity measure. In other words L1 has 1000 docs 
per cluster, L2 has 100 docs per cluster. I'd find the 100 docs closest 
to L1 clusters (that's all the user wants to see in my case) and 
reference the 10 L2 clusters nearest by centroid similarity using 
rowsimilarity to calculate. I'm hoping that this is a useful way to 
browse the information space.

Naively speaking your streaming k seems to have elements of this built in.

On 5/11/12 8:26 AM, Ted Dunning wrote:
> The streaming k-means stuff might be an interesting alternative to setting
> parameters manually.  In that work, the algorithm adaptively sets a
> parameter that has similar function to T1 and T2.
>
> More importantly, the output of the main pass is a large number of weighted
> centroids that can be used as a small surrogate for the entire data set in
> subsequent clustering.  Since these centroids should fit in memory, you
> could do an adaptive search for propitious values of T1 and T2.
>
> My github repo has a description of this algorithm with an analysis of
> scaling properties.  See
>
> https://github.com/tdunning/knn
>
> As soon as we finish the cleanup release, I will start folding in this code.
>
> On Fri, May 11, 2012 at 7:58 AM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>> The reason I use T1==T2 is that T2 is the only threshold that determines
>> the number of clusters. T1 affects how many adjacent points are considered
>> in the centroid calculations. So you could simplify your histogram analysis
>> to 2-d without affecting #clusters.
>>
>> Hierarchical clustering is one way to think about the clustering of
>> information that we have just recently added to Mahout. Any experiences you
>> can share with its application would be valuable.
>>
>> On 5/10/12 12:20 PM, Pat Ferrel wrote:
>>
>>> Naively I imagine giving a range, divide up into equal increments and
>>> calculate all relevant cluster numbers. It would take the order of (# of
>>> increments)**2  time to do but it seems to me that for a given corpus you
>>> wouldn't need to do this very often (actually you only need 1/2 this data).
>>> You would get a 3-d surface/histogram with magnitude = # of clusters, x and
>>> y = t1 and t2. Then search this data for local maxes, mins and inflection
>>> points. I'm not sure what this data would look like -- hence the "naively"
>>> disclaimer at the start. It is certainly a large landscape to search by
>>> hand.
>>>
>>> Your method only looks at the diagonal (t1==t2)and maybe that is the most
>>> interesting part, in which case the calculations are much quicker.
>>>
>>> Ultimately I'm interested in finding a better way to do hierarchical
>>> clustering. Information very often has a natural hierarchy but the usual
>>> methods produce spotty results. If we had a reasonable canopy estimator we
>>> could employ it at each level on the subset of the corpus being clustered.
>>> Doing this by hand quickly becomes prohibitive given that the number of
>>> times you have to estimate canopy values increases exponentially with each
>>> level of hierarchy
>>>
>>> Even a mediocre estimator would likely be better that picking k out of
>>> the air. And the times it would fail to produce would also tell you
>>> something about your data.
>>>
>>> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>>>
>>>> No, the issue was discussed but never reached critical mass. I typically
>>>> do a binary search to find the best value setting T1==T2 and then tweak T1
>>>> up a bit. For feeding k-means, this latter step is not so important.
>>>>
>>>> If you could figure out a way to automate this we would be interested.
>>>> Conceptually, using the RandomSeedGenerator to sample a few vectors and
>>>> comparing them with your chosen DistanceMeasure would give you a hint at
>>>> the T-value to begin the search. A utility to do that would be a useful
>>>> contribution.
>>>>
>>>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>>>
>>>>> Some thoughts on https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563>
>>>>>
>>>>> Did anything ever get done with this? Ted mentions limited usefulness.
>>>>> This may be true but the cases he mentions as counter examples are also not
>>>>> very good for using canopy ahead of kmeans, no? That info would be a useful
>>>>> result. To use canopies I find myself running it over and over trying to
>>>>> see some inflection in the number of clusters. Why not automate this? Even
>>>>> if the data shows nothing, that is itself an answer of value and it would
>>>>> save a lot of hand work to find out the same thing.
>>>>>
>>>>>
>>>>>
>>>

Re: Canopy estimator

Posted by Ted Dunning <te...@gmail.com>.

The streaming k-means stuff might be an interesting alternative to setting
parameters manually.  In that work, the algorithm adaptively sets a
parameter that has similar function to T1 and T2.

More importantly, the output of the main pass is a large number of weighted
centroids that can be used as a small surrogate for the entire data set in
subsequent clustering.  Since these centroids should fit in memory, you
could do an adaptive search for propitious values of T1 and T2.

My github repo has a description of this algorithm with an analysis of
scaling properties.  See

https://github.com/tdunning/knn

As soon as we finish the cleanup release, I will start folding in this code.

On Fri, May 11, 2012 at 7:58 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> The reason I use T1==T2 is that T2 is the only threshold that determines
> the number of clusters. T1 affects how many adjacent points are considered
> in the centroid calculations. So you could simplify your histogram analysis
> to 2-d without affecting #clusters.
>
> Hierarchical clustering is one way to think about the clustering of
> information that we have just recently added to Mahout. Any experiences you
> can share with its application would be valuable.
>
> On 5/10/12 12:20 PM, Pat Ferrel wrote:
>
>> Naively I imagine giving a range, divide up into equal increments and
>> calculate all relevant cluster numbers. It would take the order of (# of
>> increments)**2  time to do but it seems to me that for a given corpus you
>> wouldn't need to do this very often (actually you only need 1/2 this data).
>> You would get a 3-d surface/histogram with magnitude = # of clusters, x and
>> y = t1 and t2. Then search this data for local maxes, mins and inflection
>> points. I'm not sure what this data would look like -- hence the "naively"
>> disclaimer at the start. It is certainly a large landscape to search by
>> hand.
>>
>> Your method only looks at the diagonal (t1==t2)and maybe that is the most
>> interesting part, in which case the calculations are much quicker.
>>
>> Ultimately I'm interested in finding a better way to do hierarchical
>> clustering. Information very often has a natural hierarchy but the usual
>> methods produce spotty results. If we had a reasonable canopy estimator we
>> could employ it at each level on the subset of the corpus being clustered.
>> Doing this by hand quickly becomes prohibitive given that the number of
>> times you have to estimate canopy values increases exponentially with each
>> level of hierarchy
>>
>> Even a mediocre estimator would likely be better that picking k out of
>> the air. And the times it would fail to produce would also tell you
>> something about your data.
>>
>> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>>
>>> No, the issue was discussed but never reached critical mass. I typically
>>> do a binary search to find the best value setting T1==T2 and then tweak T1
>>> up a bit. For feeding k-means, this latter step is not so important.
>>>
>>> If you could figure out a way to automate this we would be interested.
>>> Conceptually, using the RandomSeedGenerator to sample a few vectors and
>>> comparing them with your chosen DistanceMeasure would give you a hint at
>>> the T-value to begin the search. A utility to do that would be a useful
>>> contribution.
>>>
>>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>>
>>>> Some thoughts on https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563>
>>>>
>>>> Did anything ever get done with this? Ted mentions limited usefulness.
>>>> This may be true but the cases he mentions as counter examples are also not
>>>> very good for using canopy ahead of kmeans, no? That info would be a useful
>>>> result. To use canopies I find myself running it over and over trying to
>>>> see some inflection in the number of clusters. Why not automate this? Even
>>>> if the data shows nothing, that is itself an answer of value and it would
>>>> save a lot of hand work to find out the same thing.
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: RowSimilarity

Posted by Suneel Marthi <su...@yahoo.com>.

The consider() method in the distance measure (Tanimoto in ur scenario) is the one that does the cut-off.
All of the similarity measures (almost all of them) have some implementation of consider() so as to cut-off the returned results.

Have a look at Sebastian's explanation in https://issues.apache.org/jira/browse/MAHOUT-803.




________________________________
 From: Pat Ferrel <pa...@occamsmachete.com>
To: user@mahout.apache.org 
Sent: Saturday, May 12, 2012 7:29 PM
Subject: RowSimilarity
 
I tried an experiment running RowSimilarity with 16 docs of short quotations on a similar subject. It looks to me that using tanimoto the largest pair-wise distance allowed for the similar docs was 0.4. Though I asked for 10 similar docs I got 0 to 10. I see this same effect with larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance myself. In other words I was expecting to always get 20 similar docs when I asked for 20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?

Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sorry but I'm still confused. So the similarity magnitude has nothing to 
do with one of mahout's distance measures, the similarity class is used 
only to specify the algorithm used to calculate this magnitude and does 
not imply a connection between distance and similarity? I'm now a bit 
unsure about how to read my results.

  * Using tanimoto for example is a value of 0.0001 more similar than a
    value of 0.9? This seems to fit my results even though below you say
    "A lack of (term) cooccurrences is equivalent to a similarity of 0"
  * Is there a description somewhere of what the similarity magnitude
    describes?

Thanks,
Pat

On 5/14/12 10:35 AM, Sebastian Schelter wrote:
> "The cutoff is made based on lack of term cooccurrences not the distance
> measure."
>
> I'd rather use the term similarity measure not distance measure as a lot
> of the measures implemented are not metric and the term 'distance' might
> be misleading
>
> A lack of (term) cooccurrences is equivalent to a similarity of 0 by
> definition, therefore the "default cutoff" is also based on the
> similarity measure.
>
> --sebastian
>
>
> On 14.05.2012 19:30, Pat Ferrel wrote:
>> Thanks, this is quite clear and reasonable.  The optional
>> 'threshold' is based on the distance measure.
>>
>> BTW I assume the 'distance' returned is expressed in the distance
>> measure's units? So using cosine as a distance measure a value near 0 is
>> actually quite similar because the measure is 1-(cosine of the angle
>> between the vectors)?
>>
>> On 5/13/12 9:10 AM, Sebastian Schelter wrote:
>>> Hi Pat,
>>>
>>> RowSimilarityJob allows the use of a lot of different similarity
>>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
>>> of which compute a single number for a pair of vectors that denotes how
>>> similar those are. All these measures have the characteristic that two
>>> vectors that do not share at least one non-zero value in a single
>>> dimension are considered not similar (have similarity 0).
>>>
>>> In general, an all-pairs comparison, as it is conducted by
>>> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>>>
>>> If we have sparse data such as text or ratings however, we can exploit
>>> the fact that we only need to compare pairs which share at least one
>>> non-zero value in a dimension. This is the basic idea behind row
>>> similarity job to avoid an all-pairs comparison.
>>>
>>> In some real-world usecases you will furthermore encounter a lot of
>>> pairs with near-zero similarities that are of little value for you. To
>>> be able to avoid computing these, RowSimilarityJob provides the option
>>> to specify a minimum threshold so that it ignores pairs with a
>>> similarity value below this threshold. This threshold is data-dependent
>>> and you have to experimentally find it.
>>>
>>> --sebastian
>>>
>>>
>>> On 13.05.2012 17:33, Pat Ferrel wrote:
>>>> To paraphrase:
>>>>
>>>> There is some internal threshold to be considered 'similar'. This is the
>>>> one supplied with the 'threshold' option mentioned below and I need to
>>>> do a special build to get this option activated? I assume it is not
>>>> active because it has not been tested well?
>>>>
>>>> So currently how is the threshold calculated? How can I determine its
>>>> value? Can I vote that this be activated as an optional parameter in the
>>>> future?
>>>>
>>>> I ask this in part because I want to use RowSimilarity in an experiment
>>>> to do something like a non-partitioning hierarchical clustering where
>>>> I'll need to find close centroids in clusters calculated with different
>>>> levels of specificity.
>>>>
>>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>>>> This could be simply due to the fact that there are less similar docs
>>>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>>>
>>>>> consider() is only invoked if a threshold was specified.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>>>     Pat's question was that he was seeing less documents than that
>>>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>>>> the 'consider' functionality of the applied similarity measure.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>>     From: Sebastian Schelter<ss...@apache.org>
>>>>>> To: user@mahout.apache.org
>>>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>>>> Subject: Re: RowSimilarity
>>>>>>
>>>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>>>> similar docs/items/rows per row. It depends on your data if there are
>>>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>>>
>>>>>> The option 'threshold' determines the minimum similarity value for a
>>>>>> pair of docs (otherwise it will be dropped). This option is not
>>>>>> activated by default however.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>>>> quotations on a similar subject. It looks to me that using
>>>>>>> tanimoto the
>>>>>>> largest pair-wise distance allowed for the similar docs was 0.4.
>>>>>>> Though
>>>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect
>>>>>>> with
>>>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>>>
>>>>>>> I was expecting to be able to make the decision about cut-off
>>>>>>> distance
>>>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>>>> distances.
>>>>>>>
>>>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>>>
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@googlemail.com>.

"The cutoff is made based on lack of term cooccurrences not the distance
measure."

I'd rather use the term similarity measure not distance measure as a lot
of the measures implemented are not metric and the term 'distance' might
be misleading

A lack of (term) cooccurrences is equivalent to a similarity of 0 by
definition, therefore the "default cutoff" is also based on the
similarity measure.

--sebastian


On 14.05.2012 19:30, Pat Ferrel wrote:
> Thanks, this is quite clear and reasonable.  The optional
> 'threshold' is based on the distance measure.
> 
> BTW I assume the 'distance' returned is expressed in the distance
> measure's units? So using cosine as a distance measure a value near 0 is
> actually quite similar because the measure is 1-(cosine of the angle
> between the vectors)?
> 
> On 5/13/12 9:10 AM, Sebastian Schelter wrote:
>> Hi Pat,
>>
>> RowSimilarityJob allows the use of a lot of different similarity
>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
>> of which compute a single number for a pair of vectors that denotes how
>> similar those are. All these measures have the characteristic that two
>> vectors that do not share at least one non-zero value in a single
>> dimension are considered not similar (have similarity 0).
>>
>> In general, an all-pairs comparison, as it is conducted by
>> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>>
>> If we have sparse data such as text or ratings however, we can exploit
>> the fact that we only need to compare pairs which share at least one
>> non-zero value in a dimension. This is the basic idea behind row
>> similarity job to avoid an all-pairs comparison.
>>
>> In some real-world usecases you will furthermore encounter a lot of
>> pairs with near-zero similarities that are of little value for you. To
>> be able to avoid computing these, RowSimilarityJob provides the option
>> to specify a minimum threshold so that it ignores pairs with a
>> similarity value below this threshold. This threshold is data-dependent
>> and you have to experimentally find it.
>>
>> --sebastian
>>
>>
>> On 13.05.2012 17:33, Pat Ferrel wrote:
>>> To paraphrase:
>>>
>>> There is some internal threshold to be considered 'similar'. This is the
>>> one supplied with the 'threshold' option mentioned below and I need to
>>> do a special build to get this option activated? I assume it is not
>>> active because it has not been tested well?
>>>
>>> So currently how is the threshold calculated? How can I determine its
>>> value? Can I vote that this be activated as an optional parameter in the
>>> future?
>>>
>>> I ask this in part because I want to use RowSimilarity in an experiment
>>> to do something like a non-partitioning hierarchical clustering where
>>> I'll need to find close centroids in clusters calculated with different
>>> levels of specificity.
>>>
>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>>> This could be simply due to the fact that there are less similar docs
>>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>>
>>>> consider() is only invoked if a threshold was specified.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>>
>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>>    Pat's question was that he was seeing less documents than that
>>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>>> the 'consider' functionality of the applied similarity measure.
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>>    From: Sebastian Schelter<ss...@apache.org>
>>>>> To: user@mahout.apache.org
>>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>>> Subject: Re: RowSimilarity
>>>>>
>>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>>> similar docs/items/rows per row. It depends on your data if there are
>>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>>
>>>>> The option 'threshold' determines the minimum similarity value for a
>>>>> pair of docs (otherwise it will be dropped). This option is not
>>>>> activated by default however.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>>> quotations on a similar subject. It looks to me that using
>>>>>> tanimoto the
>>>>>> largest pair-wise distance allowed for the similar docs was 0.4.
>>>>>> Though
>>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect
>>>>>> with
>>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>>
>>>>>> I was expecting to be able to make the decision about cut-off
>>>>>> distance
>>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>>> distances.
>>>>>>
>>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>>
>>>>
>>
>>

Re: RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Thanks, this is quite clear and reasonable. The cutoff is made based on 
lack of term cooccurrences not the distance measure. The optional 
'threshold' is based on the distance measure.

BTW I assume the 'distance' returned is expressed in the distance 
measure's units? So using cosine as a distance measure a value near 0 is 
actually quite similar because the measure is 1-(cosine of the angle 
between the vectors)?

On 5/13/12 9:10 AM, Sebastian Schelter wrote:
> Hi Pat,
>
> RowSimilarityJob allows the use of a lot of different similarity
> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
> of which compute a single number for a pair of vectors that denotes how
> similar those are. All these measures have the characteristic that two
> vectors that do not share at least one non-zero value in a single
> dimension are considered not similar (have similarity 0).
>
> In general, an all-pairs comparison, as it is conducted by
> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>
> If we have sparse data such as text or ratings however, we can exploit
> the fact that we only need to compare pairs which share at least one
> non-zero value in a dimension. This is the basic idea behind row
> similarity job to avoid an all-pairs comparison.
>
> In some real-world usecases you will furthermore encounter a lot of
> pairs with near-zero similarities that are of little value for you. To
> be able to avoid computing these, RowSimilarityJob provides the option
> to specify a minimum threshold so that it ignores pairs with a
> similarity value below this threshold. This threshold is data-dependent
> and you have to experimentally find it.
>
> --sebastian
>
>
> On 13.05.2012 17:33, Pat Ferrel wrote:
>> To paraphrase:
>>
>> There is some internal threshold to be considered 'similar'. This is the
>> one supplied with the 'threshold' option mentioned below and I need to
>> do a special build to get this option activated? I assume it is not
>> active because it has not been tested well?
>>
>> So currently how is the threshold calculated? How can I determine its
>> value? Can I vote that this be activated as an optional parameter in the
>> future?
>>
>> I ask this in part because I want to use RowSimilarity in an experiment
>> to do something like a non-partitioning hierarchical clustering where
>> I'll need to find close centroids in clusters calculated with different
>> levels of specificity.
>>
>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>> This could be simply due to the fact that there are less similar docs
>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>
>>> consider() is only invoked if a threshold was specified.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>    Pat's question was that he was seeing less documents than that
>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>> the 'consider' functionality of the applied similarity measure.
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>    From: Sebastian Schelter<ss...@apache.org>
>>>> To: user@mahout.apache.org
>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>> Subject: Re: RowSimilarity
>>>>
>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>> similar docs/items/rows per row. It depends on your data if there are
>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>
>>>> The option 'threshold' determines the minimum similarity value for a
>>>> pair of docs (otherwise it will be dropped). This option is not
>>>> activated by default however.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>> quotations on a similar subject. It looks to me that using tanimoto the
>>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>
>>>>> I was expecting to be able to make the decision about cut-off distance
>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>> distances.
>>>>>
>>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>>
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Pat,

RowSimilarityJob allows the use of a lot of different similarity
measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
of which compute a single number for a pair of vectors that denotes how
similar those are. All these measures have the characteristic that two
vectors that do not share at least one non-zero value in a single
dimension are considered not similar (have similarity 0).

In general, an all-pairs comparison, as it is conducted by
RowSimilarityJob, has quadratic complexity and is therefore not scalable.

If we have sparse data such as text or ratings however, we can exploit
the fact that we only need to compare pairs which share at least one
non-zero value in a dimension. This is the basic idea behind row
similarity job to avoid an all-pairs comparison.

In some real-world usecases you will furthermore encounter a lot of
pairs with near-zero similarities that are of little value for you. To
be able to avoid computing these, RowSimilarityJob provides the option
to specify a minimum threshold so that it ignores pairs with a
similarity value below this threshold. This threshold is data-dependent
and you have to experimentally find it.

--sebastian

On 13.05.2012 17:33, Pat Ferrel wrote:
> To paraphrase:
> 
> There is some internal threshold to be considered 'similar'. This is the
> one supplied with the 'threshold' option mentioned below and I need to
> do a special build to get this option activated? I assume it is not
> active because it has not been tested well?
> 
> So currently how is the threshold calculated? How can I determine its
> value? Can I vote that this be activated as an optional parameter in the
> future?
> 
> I ask this in part because I want to use RowSimilarity in an experiment
> to do something like a non-partitioning hierarchical clustering where
> I'll need to find close centroids in clusters calculated with different
> levels of specificity.
> 
> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>> This could be simply due to the fact that there are less similar docs
>> than the number specified in 'maxSimilaritiesPerRow'.
>>
>> consider() is only invoked if a threshold was specified.
>>
>> Best,
>> Sebastian
>>
>>
>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>   Pat's question was that he was seeing less documents than that
>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>> the 'consider' functionality of the applied similarity measure.
>>>
>>>
>>>
>>> ________________________________
>>>   From: Sebastian Schelter<ss...@apache.org>
>>> To: user@mahout.apache.org
>>> Sent: Sunday, May 13, 2012 2:08 AM
>>> Subject: Re: RowSimilarity
>>>
>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>> similar docs/items/rows per row. It depends on your data if there are
>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>
>>> The option 'threshold' determines the minimum similarity value for a
>>> pair of docs (otherwise it will be dropped). This option is not
>>> activated by default however.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>> quotations on a similar subject. It looks to me that using tanimoto the
>>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>>> larger data sets but haven't seen an obvious cut-off point
>>>>
>>>> I was expecting to be able to make the decision about cut-off distance
>>>> myself. In other words I was expecting to always get 20 similar docs
>>>> when I asked for 20. It is useful to see what docs are at larger
>>>> distances.
>>>>
>>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>>
>>
>>

Re: RowSimilarity

Posted by Pat Ferrel <pa...@farfetchers.com>.

To paraphrase:

There is some internal threshold to be considered 'similar'. This is the 
one supplied with the 'threshold' option mentioned below and I need to 
do a special build to get this option activated? I assume it is not 
active because it has not been tested well?

So currently how is the threshold calculated? How can I determine its 
value? Can I vote that this be activated as an optional parameter in the 
future?

I ask this in part because I want to use RowSimilarity in an experiment 
to do something like a non-partitioning hierarchical clustering where 
I'll need to find close centroids in clusters calculated with different 
levels of specificity.

On 5/12/12 11:38 PM, Sebastian Schelter wrote:
> This could be simply due to the fact that there are less similar docs
> than the number specified in 'maxSimilaritiesPerRow'.
>
> consider() is only invoked if a threshold was specified.
>
> Best,
> Sebastian
>
>
> On 13.05.2012 08:25, Suneel Marthi wrote:
>>   Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.
>>
>>
>>
>> ________________________________
>>   From: Sebastian Schelter<ss...@apache.org>
>> To: user@mahout.apache.org
>> Sent: Sunday, May 13, 2012 2:08 AM
>> Subject: Re: RowSimilarity
>>
>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>> similar docs/items/rows per row. It depends on your data if there are
>> enough similar rows per row, so you can't always get 20 similar docs.
>>
>> The option 'threshold' determines the minimum similarity value for a
>> pair of docs (otherwise it will be dropped). This option is not
>> activated by default however.
>>
>> Best,
>> Sebastian
>>
>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>> I tried an experiment running RowSimilarity with 16 docs of short
>>> quotations on a similar subject. It looks to me that using tanimoto the
>>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>>> larger data sets but haven't seen an obvious cut-off point
>>>
>>> I was expecting to be able to make the decision about cut-off distance
>>> myself. In other words I was expecting to always get 20 similar docs
>>> when I asked for 20. It is useful to see what docs are at larger distances.
>>>
>>> How is RowSimilarity deciding when to cut-off the returned docs?
>>>
>
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.

This could be simply due to the fact that there are less similar docs
than the number specified in 'maxSimilaritiesPerRow'.

consider() is only invoked if a threshold was specified.

Best,
Sebastian


On 13.05.2012 08:25, Suneel Marthi wrote:
>  Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.
> 
> 
> 
> ________________________________
>  From: Sebastian Schelter <ss...@apache.org>
> To: user@mahout.apache.org 
> Sent: Sunday, May 13, 2012 2:08 AM
> Subject: Re: RowSimilarity
>  
> The option 'maxSimilaritiesPerRow' determines the maximum number of
> similar docs/items/rows per row. It depends on your data if there are
> enough similar rows per row, so you can't always get 20 similar docs.
> 
> The option 'threshold' determines the minimum similarity value for a
> pair of docs (otherwise it will be dropped). This option is not
> activated by default however.
> 
> Best,
> Sebastian
> 
> On 13.05.2012 01:29, Pat Ferrel wrote:
>> I tried an experiment running RowSimilarity with 16 docs of short
>> quotations on a similar subject. It looks to me that using tanimoto the
>> largest pair-wise distance allowed for the similar docs was 0.4. Though
>> I asked for 10 similar docs I got 0 to 10. I see this same effect with
>> larger data sets but haven't seen an obvious cut-off point
>>
>> I was expecting to be able to make the decision about cut-off distance
>> myself. In other words I was expecting to always get 20 similar docs
>> when I asked for 20. It is useful to see what docs are at larger distances.
>>
>> How is RowSimilarity deciding when to cut-off the returned docs?
>>

Re: RowSimilarity

Posted by Suneel Marthi <su...@yahoo.com>.

 Pat's question was that he was seeing less documents than that specified by 'maxSimilaritiesPerRow', this could be happening due to the 'consider' functionality of the applied similarity measure.

________________________________
 From: Sebastian Schelter <ss...@apache.org>
To: user@mahout.apache.org 
Sent: Sunday, May 13, 2012 2:08 AM
Subject: Re: RowSimilarity

The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.

The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.

Best,
Sebastian

On 13.05.2012 01:29, Pat Ferrel wrote:
> I tried an experiment running RowSimilarity with 16 docs of short
> quotations on a similar subject. It looks to me that using tanimoto the
> largest pair-wise distance allowed for the similar docs was 0.4. Though
> I asked for 10 similar docs I got 0 to 10. I see this same effect with
> larger data sets but haven't seen an obvious cut-off point
> 
> I was expecting to be able to make the decision about cut-off distance
> myself. In other words I was expecting to always get 20 similar docs
> when I asked for 20. It is useful to see what docs are at larger distances.
> 
> How is RowSimilarity deciding when to cut-off the returned docs?
>

Re: RowSimilarity

Posted by Sebastian Schelter <ss...@apache.org>.

The option 'maxSimilaritiesPerRow' determines the maximum number of
similar docs/items/rows per row. It depends on your data if there are
enough similar rows per row, so you can't always get 20 similar docs.

The option 'threshold' determines the minimum similarity value for a
pair of docs (otherwise it will be dropped). This option is not
activated by default however.

Best,
Sebastian

On 13.05.2012 01:29, Pat Ferrel wrote:
> I tried an experiment running RowSimilarity with 16 docs of short
> quotations on a similar subject. It looks to me that using tanimoto the
> largest pair-wise distance allowed for the similar docs was 0.4. Though
> I asked for 10 similar docs I got 0 to 10. I see this same effect with
> larger data sets but haven't seen an obvious cut-off point
> 
> I was expecting to be able to make the decision about cut-off distance
> myself. In other words I was expecting to always get 20 similar docs
> when I asked for 20. It is useful to see what docs are at larger distances.
> 
> How is RowSimilarity deciding when to cut-off the returned docs?
>

RowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

I tried an experiment running RowSimilarity with 16 docs of short 
quotations on a similar subject. It looks to me that using tanimoto the 
largest pair-wise distance allowed for the similar docs was 0.4. Though 
I asked for 10 similar docs I got 0 to 10. I see this same effect with 
larger data sets but haven't seen an obvious cut-off point

I was expecting to be able to make the decision about cut-off distance 
myself. In other words I was expecting to always get 20 similar docs 
when I asked for 20. It is useful to see what docs are at larger distances.

How is RowSimilarity deciding when to cut-off the returned docs?

Re: Canopy estimator

Posted by Ted Dunning <te...@gmail.com>.

Yes.  It may help with variable scale.

The class technique for dealing with that is to cluster with a small number
of clusters at a gross level and then cluster each set of documents that
belong to a single large cluster.   This automatically adapts to different
scales.

The new stuff would greatly facilitate your experimentation.

On Sat, May 12, 2012 at 11:19 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> If you are asking about using your post 0.7 clustering, no I haven't yet.
> Will it help with varying scale? I assume by scale you mean the density of
> docs in certain areas of the  vector space? One thing I am trying now is
> limiting the subject matter crawled and getting a much larger sample, which
> should get me a denser distribution.

Re: Canopy estimator

Posted by Pat Ferrel <pa...@occamsmachete.com>.

If I understand your comment correctly this is why I hope that applying 
levels of specificity will help. On a particular subject L1 will give 
good quality and on another L2 will be better. I may be able to use an 
estimate of quality here to prune out bad clusters, not sure. The nature 
of my problem gives me no control over the input data in production so I 
have to come up with methods that are adaptive.

If you are asking about using your post 0.7 clustering, no I haven't 
yet. Will it help with varying scale? I assume by scale you mean the 
density of docs in certain areas of the  vector space? One thing I am 
trying now is limiting the subject matter crawled and getting a much 
larger sample, which should get me a denser distribution.

If you think it might help do I build it inside 0.7 snapshot? Is it a 
drop in replacement for kmeans?

On 5/12/12 10:33 AM, Ted Dunning wrote:
> One thing that may be happening here is that the scale of your data varies
> from place to place.
>
> Have you tried the upcoming k-means stuff?
>
> On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel<pa...@farfetchers.com>  wrote:
>
>> One problem I have is that virtually any value for T gives me a very large
>> number of canopies--on the order of 2-5 docs per cluster. Whether I create
>> clusters using random seeds or canopies they are of poor quality to my eye.
>> A few are good but many are silly. I've tried a wide range of vectorizing
>> knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene
>> filter to filer out numbers and do stemming to little avail. Using your
>> method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine)
>> and 5 docs per cluster with t = 0.95. This is telling me that the docs are
>> not really clusterable contrary to

Re: Canopy estimator

Posted by Ted Dunning <te...@gmail.com>.

One thing that may be happening here is that the scale of your data varies
from place to place.

Have you tried the upcoming k-means stuff?

On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel <pa...@farfetchers.com> wrote:

> One problem I have is that virtually any value for T gives me a very large
> number of canopies--on the order of 2-5 docs per cluster. Whether I create
> clusters using random seeds or canopies they are of poor quality to my eye.
> A few are good but many are silly. I've tried a wide range of vectorizing
> knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene
> filter to filer out numbers and do stemming to little avail. Using your
> method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine)
> and 5 docs per cluster with t = 0.95. This is telling me that the docs are
> not really clusterable contrary to

Re: Canopy estimator

Posted by Pat Ferrel <pa...@farfetchers.com>.

Wrote a shell script to do t1==t2 over a range and ist does give useful 
information. Picking a few point outside of t1==t2 doesn't seem to 
affect things by much, number of clusters-wise. Since there is really no 
way to talk about canopy quality AKAIK the number is how I make a decision.

One problem I have is that virtually any value for T gives me a very 
large number of canopies--on the order of 2-5 docs per cluster. Whether 
I create clusters using random seeds or canopies they are of poor 
quality to my eye. A few are good but many are silly. I've tried a wide 
range of vectorizing knobs including L2 norm, n-grams with a high ml, 
and doing a cutom lucene filter to filer out numbers and do stemming to 
little avail. Using your method of t1==t2 - get 2 docs per cluster with 
t=0.3 (tanimoto or cosine) and 5 docs per cluster with t = 0.95. This is 
telling me that the docs are not really clusterable contrary to intuition.

Next stop SVD? Maybe a larger data set from fewer sources will help?

As to hierarchical clustering in my case it makes little sense when 
canopies gives 2-5 docs per cluster. My experimental data set is web 
crawled news since it has a clear hierarchy, you can easily see it in 
categories like root:sports:baseball, soccer, basketball, etc.

As to hierarchical clustering using another tool set where we had a 
proprietary patented algorithm for picking k it worked pretty well. It 
was for email though so it was not very noisy data. What I was hoping to 
do is use canopy or other method to estimate cluster numbers 
automatically for each level and if I can get a crude canopy estimator 
working I'll report back.

On 5/11/12 7:58 AM, Jeff Eastman wrote:
> The reason I use T1==T2 is that T2 is the only threshold that 
> determines the number of clusters. T1 affects how many adjacent points 
> are considered in the centroid calculations. So you could simplify 
> your histogram analysis to 2-d without affecting #clusters.
>
> Hierarchical clustering is one way to think about the clustering of 
> information that we have just recently added to Mahout. Any 
> experiences you can share with its application would be valuable.
>
> On 5/10/12 12:20 PM, Pat Ferrel wrote:
>> Naively I imagine giving a range, divide up into equal increments and 
>> calculate all relevant cluster numbers. It would take the order of (# 
>> of increments)**2  time to do but it seems to me that for a given 
>> corpus you wouldn't need to do this very often (actually you only 
>> need 1/2 this data). You would get a 3-d surface/histogram with 
>> magnitude = # of clusters, x and y = t1 and t2. Then search this data 
>> for local maxes, mins and inflection points. I'm not sure what this 
>> data would look like -- hence the "naively" disclaimer at the start. 
>> It is certainly a large landscape to search by hand.
>>
>> Your method only looks at the diagonal (t1==t2)and maybe that is the 
>> most interesting part, in which case the calculations are much quicker.
>>
>> Ultimately I'm interested in finding a better way to do hierarchical 
>> clustering. Information very often has a natural hierarchy but the 
>> usual methods produce spotty results. If we had a reasonable canopy 
>> estimator we could employ it at each level on the subset of the 
>> corpus being clustered. Doing this by hand quickly becomes 
>> prohibitive given that the number of times you have to estimate 
>> canopy values increases exponentially with each level of hierarchy
>>
>> Even a mediocre estimator would likely be better that picking k out 
>> of the air. And the times it would fail to produce would also tell 
>> you something about your data.
>>
>> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>>> No, the issue was discussed but never reached critical mass. I 
>>> typically do a binary search to find the best value setting T1==T2 
>>> and then tweak T1 up a bit. For feeding k-means, this latter step is 
>>> not so important.
>>>
>>> If you could figure out a way to automate this we would be 
>>> interested. Conceptually, using the RandomSeedGenerator to sample a 
>>> few vectors and comparing them with your chosen DistanceMeasure 
>>> would give you a hint at the T-value to begin the search. A utility 
>>> to do that would be a useful contribution.
>>>
>>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>>> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
>>>>
>>>> Did anything ever get done with this? Ted mentions limited 
>>>> usefulness. This may be true but the cases he mentions as counter 
>>>> examples are also not very good for using canopy ahead of kmeans, 
>>>> no? That info would be a useful result. To use canopies I find 
>>>> myself running it over and over trying to see some inflection in 
>>>> the number of clusters. Why not automate this? Even if the data 
>>>> shows nothing, that is itself an answer of value and it would save 
>>>> a lot of hand work to find out the same thing.
>>>>
>>>>
>>>
>>
>>
>

Re: Canopy estimator

Posted by gaurav redkar <ga...@gmail.com>.

I have tried out a naive method to estimate the values for t1 and t2 to be
used for meanshift clustering. Any suggestions or comments about
shortcomings of my approach are welcome.

i take a sample of the dataset and compute pairwise similarity between all
the points in the sample using the same distance measure that i will use
while performing clustering(euclidean distance in my case). For simplicity
of explanation, assume i take 3 points from my dataset.
computing similarity among these points give me a similarity matrix as shown

   0 1.56 1.4 1.56 0 1.36 1.4 1.36 0

now looking at each column(or row coz the matrix is symmetric) u can see
that the largest element in the column(or row) is the distance which if
used as t1 will cover all the points from the chosen sample.(and probably
cover a  sizeable percentage of the entire dataset). that would result to
all the points mergin to 1 or few clusters

in order to choose t1:-
 i take the mean of all the elements in each column (ignoring the 0's in
the diagonal).
for the matrix shown above, we get the values
   1.48 1.46 1.38

then i take the average of these values , i.e. t1=1.44

in order to choose t2:-
i find the minimum element in each column (ignoring the 0's in the
diagonal) which will give me
1.4  ,1.36 , 1.36.
 to choose the value of t2 i intend to take mean of all the minimum
elements in each column.

then select the mean of these values , t2=1.37

Any comments on the approach

Thanks
Gaurav



On Fri, May 11, 2012 at 8:28 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> The reason I use T1==T2 is that T2 is the only threshold that determines
> the number of clusters. T1 affects how many adjacent points are considered
> in the centroid calculations. So you could simplify your histogram analysis
> to 2-d without affecting #clusters.
>
> Hierarchical clustering is one way to think about the clustering of
> information that we have just recently added to Mahout. Any experiences you
> can share with its application would be valuable.
>
>
> On 5/10/12 12:20 PM, Pat Ferrel wrote:
>
>> Naively I imagine giving a range, divide up into equal increments and
>> calculate all relevant cluster numbers. It would take the order of (# of
>> increments)**2  time to do but it seems to me that for a given corpus you
>> wouldn't need to do this very often (actually you only need 1/2 this data).
>> You would get a 3-d surface/histogram with magnitude = # of clusters, x and
>> y = t1 and t2. Then search this data for local maxes, mins and inflection
>> points. I'm not sure what this data would look like -- hence the "naively"
>> disclaimer at the start. It is certainly a large landscape to search by
>> hand.
>>
>> Your method only looks at the diagonal (t1==t2)and maybe that is the most
>> interesting part, in which case the calculations are much quicker.
>>
>> Ultimately I'm interested in finding a better way to do hierarchical
>> clustering. Information very often has a natural hierarchy but the usual
>> methods produce spotty results. If we had a reasonable canopy estimator we
>> could employ it at each level on the subset of the corpus being clustered.
>> Doing this by hand quickly becomes prohibitive given that the number of
>> times you have to estimate canopy values increases exponentially with each
>> level of hierarchy
>>
>> Even a mediocre estimator would likely be better that picking k out of
>> the air. And the times it would fail to produce would also tell you
>> something about your data.
>>
>> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>>
>>> No, the issue was discussed but never reached critical mass. I typically
>>> do a binary search to find the best value setting T1==T2 and then tweak T1
>>> up a bit. For feeding k-means, this latter step is not so important.
>>>
>>> If you could figure out a way to automate this we would be interested.
>>> Conceptually, using the RandomSeedGenerator to sample a few vectors and
>>> comparing them with your chosen DistanceMeasure would give you a hint at
>>> the T-value to begin the search. A utility to do that would be a useful
>>> contribution.
>>>
>>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>>
>>>> Some thoughts on https://issues.apache.org/**jira/browse/MAHOUT-563<https://issues.apache.org/jira/browse/MAHOUT-563>
>>>>
>>>> Did anything ever get done with this? Ted mentions limited usefulness.
>>>> This may be true but the cases he mentions as counter examples are also not
>>>> very good for using canopy ahead of kmeans, no? That info would be a useful
>>>> result. To use canopies I find myself running it over and over trying to
>>>> see some inflection in the number of clusters. Why not automate this? Even
>>>> if the data shows nothing, that is itself an answer of value and it would
>>>> save a lot of hand work to find out the same thing.
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Canopy estimator

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

The reason I use T1==T2 is that T2 is the only threshold that determines 
the number of clusters. T1 affects how many adjacent points are 
considered in the centroid calculations. So you could simplify your 
histogram analysis to 2-d without affecting #clusters.

Hierarchical clustering is one way to think about the clustering of 
information that we have just recently added to Mahout. Any experiences 
you can share with its application would be valuable.

On 5/10/12 12:20 PM, Pat Ferrel wrote:
> Naively I imagine giving a range, divide up into equal increments and 
> calculate all relevant cluster numbers. It would take the order of (# 
> of increments)**2  time to do but it seems to me that for a given 
> corpus you wouldn't need to do this very often (actually you only need 
> 1/2 this data). You would get a 3-d surface/histogram with magnitude = 
> # of clusters, x and y = t1 and t2. Then search this data for local 
> maxes, mins and inflection points. I'm not sure what this data would 
> look like -- hence the "naively" disclaimer at the start. It is 
> certainly a large landscape to search by hand.
>
> Your method only looks at the diagonal (t1==t2)and maybe that is the 
> most interesting part, in which case the calculations are much quicker.
>
> Ultimately I'm interested in finding a better way to do hierarchical 
> clustering. Information very often has a natural hierarchy but the 
> usual methods produce spotty results. If we had a reasonable canopy 
> estimator we could employ it at each level on the subset of the corpus 
> being clustered. Doing this by hand quickly becomes prohibitive given 
> that the number of times you have to estimate canopy values increases 
> exponentially with each level of hierarchy
>
> Even a mediocre estimator would likely be better that picking k out of 
> the air. And the times it would fail to produce would also tell you 
> something about your data.
>
> On 5/10/12 6:12 AM, Jeff Eastman wrote:
>> No, the issue was discussed but never reached critical mass. I 
>> typically do a binary search to find the best value setting T1==T2 
>> and then tweak T1 up a bit. For feeding k-means, this latter step is 
>> not so important.
>>
>> If you could figure out a way to automate this we would be 
>> interested. Conceptually, using the RandomSeedGenerator to sample a 
>> few vectors and comparing them with your chosen DistanceMeasure would 
>> give you a hint at the T-value to begin the search. A utility to do 
>> that would be a useful contribution.
>>
>> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>>> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
>>>
>>> Did anything ever get done with this? Ted mentions limited 
>>> usefulness. This may be true but the cases he mentions as counter 
>>> examples are also not very good for using canopy ahead of kmeans, 
>>> no? That info would be a useful result. To use canopies I find 
>>> myself running it over and over trying to see some inflection in the 
>>> number of clusters. Why not automate this? Even if the data shows 
>>> nothing, that is itself an answer of value and it would save a lot 
>>> of hand work to find out the same thing.
>>>
>>>
>>
>
>

Re: Canopy estimator

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Naively I imagine giving a range, divide up into equal increments and 
calculate all relevant cluster numbers. It would take the order of (# of 
increments)**2  time to do but it seems to me that for a given corpus 
you wouldn't need to do this very often (actually you only need 1/2 this 
data). You would get a 3-d surface/histogram with magnitude = # of 
clusters, x and y = t1 and t2. Then search this data for local maxes, 
mins and inflection points. I'm not sure what this data would look like 
-- hence the "naively" disclaimer at the start. It is certainly a large 
landscape to search by hand.

Your method only looks at the diagonal (t1==t2)and maybe that is the 
most interesting part, in which case the calculations are much quicker.

Ultimately I'm interested in finding a better way to do hierarchical 
clustering. Information very often has a natural hierarchy but the usual 
methods produce spotty results. If we had a reasonable canopy estimator 
we could employ it at each level on the subset of the corpus being 
clustered. Doing this by hand quickly becomes prohibitive given that the 
number of times you have to estimate canopy values increases 
exponentially with each level of hierarchy

Even a mediocre estimator would likely be better that picking k out of 
the air. And the times it would fail to produce would also tell you 
something about your data.

On 5/10/12 6:12 AM, Jeff Eastman wrote:
> No, the issue was discussed but never reached critical mass. I 
> typically do a binary search to find the best value setting T1==T2 and 
> then tweak T1 up a bit. For feeding k-means, this latter step is not 
> so important.
>
> If you could figure out a way to automate this we would be interested. 
> Conceptually, using the RandomSeedGenerator to sample a few vectors 
> and comparing them with your chosen DistanceMeasure would give you a 
> hint at the T-value to begin the search. A utility to do that would be 
> a useful contribution.
>
> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> Did anything ever get done with this? Ted mentions limited 
>> usefulness. This may be true but the cases he mentions as counter 
>> examples are also not very good for using canopy ahead of kmeans, no? 
>> That info would be a useful result. To use canopies I find myself 
>> running it over and over trying to see some inflection in the number 
>> of clusters. Why not automate this? Even if the data shows nothing, 
>> that is itself an answer of value and it would save a lot of hand 
>> work to find out the same thing.
>>
>>
>

Re: Canopy estimator

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

No, the issue was discussed but never reached critical mass. I typically 
do a binary search to find the best value setting T1==T2 and then tweak 
T1 up a bit. For feeding k-means, this latter step is not so important.

If you could figure out a way to automate this we would be interested. 
Conceptually, using the RandomSeedGenerator to sample a few vectors and 
comparing them with your chosen DistanceMeasure would give you a hint at 
the T-value to begin the search. A utility to do that would be a useful 
contribution.

On 5/9/12 8:36 PM, Pat Ferrel wrote:
> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
>
> Did anything ever get done with this? Ted mentions limited usefulness. 
> This may be true but the cases he mentions as counter examples are 
> also not very good for using canopy ahead of kmeans, no? That info 
> would be a useful result. To use canopies I find myself running it 
> over and over trying to see some inflection in the number of clusters. 
> Why not automate this? Even if the data shows nothing, that is itself 
> an answer of value and it would save a lot of hand work to find out 
> the same thing.
>
>