You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Szymon Chojnacki <sa...@o2.pl> on 2011/02/28 20:55:13 UTC

T1 and T2 in Canopy

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards 
Szymon 

ps.
I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/

RE: T1 and T2 in Canopy

Posted by Jeff Eastman <je...@Narus.com>.

I have no good heuristics for setting t3 and t4, only to suggest that the centroid averages done by the canopy mapper using t1 and t2 would tend to create less-sparse centroid vectors for the reducer step and this might tend to make the points be closer together in that pass. I would suggest t3<t1 and t4<t2 but by how much is anybody's guess.

-----Original Message-----
From: Konstantin Shmakov [mailto:kshmakov@gmail.com] 
Sent: Thursday, June 09, 2011 2:41 PM
To: user@mahout.apache.org
Subject: Re: T1 and T2 in Canopy

Hello

I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
  --t3 (-t3) t3                              T3 (Reducer T1) threshold value

  --t4 (-t4) t4                              T4 (Reducer T2) threshold value

Thanks for adding them.

Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep  t1>=t2 and experiment with t3,t4 to speed-up reducer
phase?
b) what is the relative values of t1,t2,t3,t4
       t1>t2>t3>t4?
       t3>t4>t1>t2?


Some background:
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever  for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time

In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?

Thanks
Konstantin


On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <je...@narus.com> wrote:

> I've got a patch which adds T3/T4 arguments to Canopy. I will create an
> issue for it and post the patch later today. If this is useful, and not just
> two more knobs to guess the values for, I will commit it and we can take a
> look at MeanShift. I suspect you are correct here too but it needs more
> investigation.
>
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> Sent: Tuesday, March 08, 2011 2:43 PM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
>
> Such functionality would be appreciated,
>
> I think that similar problems can happen with MeanShift,
> I have made a few attempts to run MeanShift with the same large, spare
> dataset and either I get one trivial cluster or the algorithm virtually
> stops. I'll investigate the issue further
>
> Cheers
>
> Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):
>
> > It seems like we need to introduce additional T arguments for the reduce
> step (T3, T4?) to help control these situations. The default could be to use
> the T1/T2 values. It's a pretty simple patch: add the new parameters to
> CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into
> clusterDateMR() and put into conf; a little tricky to get the
> CanopyClusterer initialized right in the reducer (perhaps a new constructor
> to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise
> it and it would be a nice addition to Mahout. I can probably get to it by
> this weekend if nobody else wants to attempt it.
> >
> > Worth doing?
> >
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > Sent: Tuesday, March 01, 2011 2:22 AM
> > To: user@mahout.apache.org
> > Subject: RE: T1 and T2 in Canopy
> >
> > Thank you Jeff for your advice,
> >
> > I think that the problems I encounter are characteristic for the
> structure of our dataset. The cardinality of the vectors is 20K, whereas an
> average number of non-zero coordinates is ~50. I checked with a sample that
> on average 12% of the distances between the vectors are maximum (i.e. there
> is no overlap in the non-zero coordinates). Moreover, the same values of T1
> and T2 are used in mappers and in a reducer. Which imposes another challenge
> as the distances among the centroids transferred to the reducer probably
> have different distribution than the distances between pure vectors.
> >
> > The process blows up either at the very begining (too many centroids are
> created in mappers) or after the mappers transfer the centroids to the
> reducer (as I see there is only one reducer hard-coded and everything has to
> be processed by one node)
> >
> > Cheers
> > Szymon
> >
> >
> > Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> >
> > > Canopy can be difficult to control and it appears you may have found a
> use case for not enforcing T1>T2 (we don't). It is curious, though, that the
> settings you have chosen assign points to canopies (dist<T2) but does not
> include all of their weights (T2>dist>T1) in the centroids. What happens if
> you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give
> you the same number of clusters, but it would also add the centers of the
> outliers (dist>1.15). Is this where your processing time blows up?
> > >
> > > -----Original Message-----
> > > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > > Sent: Monday, February 28, 2011 11:55 AM
> > > To: user@mahout.apache.org
> > > Subject: T1 and T2 in Canopy
> > >
> > > Hello,
> > >
> > > I am working with my colleague Tim within a Mahout-588 project (
> https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project
> is to compare mahout's clustering algorithms with Apache-Mail-Archives
> dataset (6 million emails). I have spent last few days trying to set such
> values of T1 and T2, which would give a non-trivial set of clusters (>1 and
> < # of all vectors). And would output the result within e.g. up to 3h.
> > >
> > > I would be greatful for your advice, as the only way I can do it was by
> breaking the rule from the wiki that (T1>T1). The problem is that if T1 is
> large than we get many non-empty coordinates in each canopy. And both memory
> and cpu demand grows. However, setting low T1 results in low T2, which leads
> to large number of canopies. And the same problem with memory and cpu.
> > >
> > > My understanding of the source code is that T1 and T2 are independent.
> So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after
> 40 mins.
> > >
> > > Thank you in advance for you suggestions on setting T1 and T2, and the
> importance of T1>T2 constraint.
> > >
> > > Kind regards
> > > Szymon
> > >
> > > ps.
> > > I described my struggle in detail in
> https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf
> .
> > >
> > >
> >
> >
>
> --
> Szymon Chojnacki
> http://www.ipipan.eu/~sch/
>



-- 
ksh:

Re: T1 and T2 in Canopy

Posted by Konstantin Shmakov <ks...@gmail.com>.

Hello

I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
  --t3 (-t3) t3                              T3 (Reducer T1) threshold value

  --t4 (-t4) t4                              T4 (Reducer T2) threshold value

Thanks for adding them.

Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep  t1>=t2 and experiment with t3,t4 to speed-up reducer
phase?
b) what is the relative values of t1,t2,t3,t4
       t1>t2>t3>t4?
       t3>t4>t1>t2?


Some background:
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever  for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time

In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?

Thanks
Konstantin


On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <je...@narus.com> wrote:

> I've got a patch which adds T3/T4 arguments to Canopy. I will create an
> issue for it and post the patch later today. If this is useful, and not just
> two more knobs to guess the values for, I will commit it and we can take a
> look at MeanShift. I suspect you are correct here too but it needs more
> investigation.
>
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> Sent: Tuesday, March 08, 2011 2:43 PM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
>
> Such functionality would be appreciated,
>
> I think that similar problems can happen with MeanShift,
> I have made a few attempts to run MeanShift with the same large, spare
> dataset and either I get one trivial cluster or the algorithm virtually
> stops. I'll investigate the issue further
>
> Cheers
>
> Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):
>
> > It seems like we need to introduce additional T arguments for the reduce
> step (T3, T4?) to help control these situations. The default could be to use
> the T1/T2 values. It's a pretty simple patch: add the new parameters to
> CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into
> clusterDateMR() and put into conf; a little tricky to get the
> CanopyClusterer initialized right in the reducer (perhaps a new constructor
> to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise
> it and it would be a nice addition to Mahout. I can probably get to it by
> this weekend if nobody else wants to attempt it.
> >
> > Worth doing?
> >
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > Sent: Tuesday, March 01, 2011 2:22 AM
> > To: user@mahout.apache.org
> > Subject: RE: T1 and T2 in Canopy
> >
> > Thank you Jeff for your advice,
> >
> > I think that the problems I encounter are characteristic for the
> structure of our dataset. The cardinality of the vectors is 20K, whereas an
> average number of non-zero coordinates is ~50. I checked with a sample that
> on average 12% of the distances between the vectors are maximum (i.e. there
> is no overlap in the non-zero coordinates). Moreover, the same values of T1
> and T2 are used in mappers and in a reducer. Which imposes another challenge
> as the distances among the centroids transferred to the reducer probably
> have different distribution than the distances between pure vectors.
> >
> > The process blows up either at the very begining (too many centroids are
> created in mappers) or after the mappers transfer the centroids to the
> reducer (as I see there is only one reducer hard-coded and everything has to
> be processed by one node)
> >
> > Cheers
> > Szymon
> >
> >
> > Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> >
> > > Canopy can be difficult to control and it appears you may have found a
> use case for not enforcing T1>T2 (we don't). It is curious, though, that the
> settings you have chosen assign points to canopies (dist<T2) but does not
> include all of their weights (T2>dist>T1) in the centroids. What happens if
> you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give
> you the same number of clusters, but it would also add the centers of the
> outliers (dist>1.15). Is this where your processing time blows up?
> > >
> > > -----Original Message-----
> > > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > > Sent: Monday, February 28, 2011 11:55 AM
> > > To: user@mahout.apache.org
> > > Subject: T1 and T2 in Canopy
> > >
> > > Hello,
> > >
> > > I am working with my colleague Tim within a Mahout-588 project (
> https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project
> is to compare mahout's clustering algorithms with Apache-Mail-Archives
> dataset (6 million emails). I have spent last few days trying to set such
> values of T1 and T2, which would give a non-trivial set of clusters (>1 and
> < # of all vectors). And would output the result within e.g. up to 3h.
> > >
> > > I would be greatful for your advice, as the only way I can do it was by
> breaking the rule from the wiki that (T1>T1). The problem is that if T1 is
> large than we get many non-empty coordinates in each canopy. And both memory
> and cpu demand grows. However, setting low T1 results in low T2, which leads
> to large number of canopies. And the same problem with memory and cpu.
> > >
> > > My understanding of the source code is that T1 and T2 are independent.
> So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after
> 40 mins.
> > >
> > > Thank you in advance for you suggestions on setting T1 and T2, and the
> importance of T1>T2 constraint.
> > >
> > > Kind regards
> > > Szymon
> > >
> > > ps.
> > > I described my struggle in detail in
> https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf
> .
> > >
> > >
> >
> >
>
> --
> Szymon Chojnacki
> http://www.ipipan.eu/~sch/
>



-- 
ksh:

RE: T1 and T2 in Canopy

Posted by Jeff Eastman <je...@Narus.com>.

I've got a patch which adds T3/T4 arguments to Canopy. I will create an issue for it and post the patch later today. If this is useful, and not just two more knobs to guess the values for, I will commit it and we can take a look at MeanShift. I suspect you are correct here too but it needs more investigation.

-----Original Message-----
From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
Sent: Tuesday, March 08, 2011 2:43 PM
To: user@mahout.apache.org
Subject: RE: T1 and T2 in Canopy

Such functionality would be appreciated,

I think that similar problems can happen with MeanShift, 
I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it. 
> 
> Worth doing?
> 
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> Sent: Tuesday, March 01, 2011 2:22 AM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
> 
> Thank you Jeff for your advice,
> 
> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors. 
> 
> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)
> 
> Cheers
> Szymon 
> 
> 
> Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> 
> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?
> > 
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> > Sent: Monday, February 28, 2011 11:55 AM
> > To: user@mahout.apache.org
> > Subject: T1 and T2 in Canopy
> > 
> > Hello,
> > 
> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.
> > 
> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.
> > 
> > My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
> > 
> > Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.
> > 
> > Kind regards 
> > Szymon 
> > 
> > ps.
> > I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 
> > 
> > 
> 
> 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/

Re: T1 and T2 in Canopy

Posted by Lance Norskog <go...@gmail.com>.

High-dimensional vectors don't work as well as 2D vectors with
Manhattan or Euclidean distance.

Minkowski distance is a real-valued variant where Minkowski (1.0) is
Manhattan and Minkowski(2.0) is Euclidean. You can try
MinkowskiDistanceMeasure(0.00001) or MinkowskiDistanceMeasure(10000)
and see if these are more interesting. There are a few more distance
algorithms.

I would experiment on small datasets and do stats on various distances
etc. Pairs of vectors that you can understand (with term strings)
matched with distances could be a real eye-opener.

Lance

On Tue, Mar 8, 2011 at 2:42 PM, Szymon Chojnacki <sa...@o2.pl> wrote:
> Such functionality would be appreciated,
>
> I think that similar problems can happen with MeanShift,
> I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further
>
> Cheers
>
> Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):
>
>> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.
>>
>> Worth doing?
>>
>> -----Original Message-----
>> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
>> Sent: Tuesday, March 01, 2011 2:22 AM
>> To: user@mahout.apache.org
>> Subject: RE: T1 and T2 in Canopy
>>
>> Thank you Jeff for your advice,
>>
>> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.
>>
>> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)
>>
>> Cheers
>> Szymon
>>
>>
>> Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
>>
>> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?
>> >
>> > -----Original Message-----
>> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
>> > Sent: Monday, February 28, 2011 11:55 AM
>> > To: user@mahout.apache.org
>> > Subject: T1 and T2 in Canopy
>> >
>> > Hello,
>> >
>> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.
>> >
>> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.
>> >
>> > My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
>> >
>> > Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.
>> >
>> > Kind regards
>> > Szymon
>> >
>> > ps.
>> > I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.
>> >
>> >
>>
>>
>
> --
> Szymon Chojnacki
> http://www.ipipan.eu/~sch/
>



-- 
Lance Norskog
goksron@gmail.com

RE: T1 and T2 in Canopy

Posted by Szymon Chojnacki <sa...@o2.pl>.

Such functionality would be appreciated,

I think that similar problems can happen with MeanShift, 
I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it. 
> 
> Worth doing?
> 
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> Sent: Tuesday, March 01, 2011 2:22 AM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
> 
> Thank you Jeff for your advice,
> 
> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors. 
> 
> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)
> 
> Cheers
> Szymon 
> 
> 
> Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> 
> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?
> > 
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> > Sent: Monday, February 28, 2011 11:55 AM
> > To: user@mahout.apache.org
> > Subject: T1 and T2 in Canopy
> > 
> > Hello,
> > 
> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.
> > 
> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.
> > 
> > My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
> > 
> > Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.
> > 
> > Kind regards 
> > Szymon 
> > 
> > ps.
> > I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 
> > 
> > 
> 
> 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/

RE: T1 and T2 in Canopy

Posted by Jeff Eastman <je...@Narus.com>.

It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it. 

Worth doing?

-----Original Message-----
From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
Sent: Tuesday, March 01, 2011 2:22 AM
To: user@mahout.apache.org
Subject: RE: T1 and T2 in Canopy

Thank you Jeff for your advice,

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors. 

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers
Szymon 

Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?
> 
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> Sent: Monday, February 28, 2011 11:55 AM
> To: user@mahout.apache.org
> Subject: T1 and T2 in Canopy
> 
> Hello,
> 
> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.
> 
> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.
> 
> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
> 
> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.
> 
> Kind regards 
> Szymon 
> 
> ps.
> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 
> 
> 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/

RE: T1 and T2 in Canopy

Posted by Szymon Chojnacki <sa...@o2.pl>.

Thank you Jeff for your advice,

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors. 

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers
Szymon 


Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?
> 
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
> Sent: Monday, February 28, 2011 11:55 AM
> To: user@mahout.apache.org
> Subject: T1 and T2 in Canopy
> 
> Hello,
> 
> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.
> 
> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.
> 
> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.
> 
> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.
> 
> Kind regards 
> Szymon 
> 
> ps.
> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 
> 
> 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/

RE: T1 and T2 in Canopy

Posted by Jeff Eastman <je...@Narus.com>.

Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

-----Original Message-----
From: Szymon Chojnacki [mailto:sajmmon@o2.pl] 
Sent: Monday, February 28, 2011 11:55 AM
To: user@mahout.apache.org
Subject: T1 and T2 in Canopy

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards 
Szymon 

ps.
I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf. 

-- 
Szymon Chojnacki
http://www.ipipan.eu/~sch/