You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Konstantin Shmakov <ks...@gmail.com> on 2011/06/09 23:41:09 UTC

Re: T1 and T2 in Canopy

Hello

I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
  --t3 (-t3) t3                              T3 (Reducer T1) threshold value

  --t4 (-t4) t4                              T4 (Reducer T2) threshold value

Thanks for adding them.

Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep  t1>=t2 and experiment with t3,t4 to speed-up reducer
phase?
b) what is the relative values of t1,t2,t3,t4
       t1>t2>t3>t4?
       t3>t4>t1>t2?


Some background:
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever  for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time

In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?

Thanks
Konstantin


On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <je...@narus.com> wrote:

> I've got a patch which adds T3/T4 arguments to Canopy. I will create an
> issue for it and post the patch later today. If this is useful, and not just
> two more knobs to guess the values for, I will commit it and we can take a
> look at MeanShift. I suspect you are correct here too but it needs more
> investigation.
>
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> Sent: Tuesday, March 08, 2011 2:43 PM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
>
> Such functionality would be appreciated,
>
> I think that similar problems can happen with MeanShift,
> I have made a few attempts to run MeanShift with the same large, spare
> dataset and either I get one trivial cluster or the algorithm virtually
> stops. I'll investigate the issue further
>
> Cheers
>
> Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):
>
> > It seems like we need to introduce additional T arguments for the reduce
> step (T3, T4?) to help control these situations. The default could be to use
> the T1/T2 values. It's a pretty simple patch: add the new parameters to
> CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into
> clusterDateMR() and put into conf; a little tricky to get the
> CanopyClusterer initialized right in the reducer (perhaps a new constructor
> to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise
> it and it would be a nice addition to Mahout. I can probably get to it by
> this weekend if nobody else wants to attempt it.
> >
> > Worth doing?
> >
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > Sent: Tuesday, March 01, 2011 2:22 AM
> > To: user@mahout.apache.org
> > Subject: RE: T1 and T2 in Canopy
> >
> > Thank you Jeff for your advice,
> >
> > I think that the problems I encounter are characteristic for the
> structure of our dataset. The cardinality of the vectors is 20K, whereas an
> average number of non-zero coordinates is ~50. I checked with a sample that
> on average 12% of the distances between the vectors are maximum (i.e. there
> is no overlap in the non-zero coordinates). Moreover, the same values of T1
> and T2 are used in mappers and in a reducer. Which imposes another challenge
> as the distances among the centroids transferred to the reducer probably
> have different distribution than the distances between pure vectors.
> >
> > The process blows up either at the very begining (too many centroids are
> created in mappers) or after the mappers transfer the centroids to the
> reducer (as I see there is only one reducer hard-coded and everything has to
> be processed by one node)
> >
> > Cheers
> > Szymon
> >
> >
> > Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> >
> > > Canopy can be difficult to control and it appears you may have found a
> use case for not enforcing T1>T2 (we don't). It is curious, though, that the
> settings you have chosen assign points to canopies (dist<T2) but does not
> include all of their weights (T2>dist>T1) in the centroids. What happens if
> you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give
> you the same number of clusters, but it would also add the centers of the
> outliers (dist>1.15). Is this where your processing time blows up?
> > >
> > > -----Original Message-----
> > > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > > Sent: Monday, February 28, 2011 11:55 AM
> > > To: user@mahout.apache.org
> > > Subject: T1 and T2 in Canopy
> > >
> > > Hello,
> > >
> > > I am working with my colleague Tim within a Mahout-588 project (
> https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project
> is to compare mahout's clustering algorithms with Apache-Mail-Archives
> dataset (6 million emails). I have spent last few days trying to set such
> values of T1 and T2, which would give a non-trivial set of clusters (>1 and
> < # of all vectors). And would output the result within e.g. up to 3h.
> > >
> > > I would be greatful for your advice, as the only way I can do it was by
> breaking the rule from the wiki that (T1>T1). The problem is that if T1 is
> large than we get many non-empty coordinates in each canopy. And both memory
> and cpu demand grows. However, setting low T1 results in low T2, which leads
> to large number of canopies. And the same problem with memory and cpu.
> > >
> > > My understanding of the source code is that T1 and T2 are independent.
> So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after
> 40 mins.
> > >
> > > Thank you in advance for you suggestions on setting T1 and T2, and the
> importance of T1>T2 constraint.
> > >
> > > Kind regards
> > > Szymon
> > >
> > > ps.
> > > I described my struggle in detail in
> https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf
> .
> > >
> > >
> >
> >
>
> --
> Szymon Chojnacki
> http://www.ipipan.eu/~sch/
>



-- 
ksh:

RE: T1 and T2 in Canopy

Posted by Jeff Eastman <je...@Narus.com>.

I have no good heuristics for setting t3 and t4, only to suggest that the centroid averages done by the canopy mapper using t1 and t2 would tend to create less-sparse centroid vectors for the reducer step and this might tend to make the points be closer together in that pass. I would suggest t3<t1 and t4<t2 but by how much is anybody's guess.

-----Original Message-----
From: Konstantin Shmakov [mailto:kshmakov@gmail.com] 
Sent: Thursday, June 09, 2011 2:41 PM
To: user@mahout.apache.org
Subject: Re: T1 and T2 in Canopy

Hello

I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
  --t3 (-t3) t3                              T3 (Reducer T1) threshold value

  --t4 (-t4) t4                              T4 (Reducer T2) threshold value

Thanks for adding them.

Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep  t1>=t2 and experiment with t3,t4 to speed-up reducer
phase?
b) what is the relative values of t1,t2,t3,t4
       t1>t2>t3>t4?
       t3>t4>t1>t2?


Some background:
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever  for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time

In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?

Thanks
Konstantin


On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <je...@narus.com> wrote:

> I've got a patch which adds T3/T4 arguments to Canopy. I will create an
> issue for it and post the patch later today. If this is useful, and not just
> two more knobs to guess the values for, I will commit it and we can take a
> look at MeanShift. I suspect you are correct here too but it needs more
> investigation.
>
> -----Original Message-----
> From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> Sent: Tuesday, March 08, 2011 2:43 PM
> To: user@mahout.apache.org
> Subject: RE: T1 and T2 in Canopy
>
> Such functionality would be appreciated,
>
> I think that similar problems can happen with MeanShift,
> I have made a few attempts to run MeanShift with the same large, spare
> dataset and either I get one trivial cluster or the algorithm virtually
> stops. I'll investigate the issue further
>
> Cheers
>
> Dnia 1 marca 2011 17:56 Jeff Eastman <je...@Narus.com> napisał(a):
>
> > It seems like we need to introduce additional T arguments for the reduce
> step (T3, T4?) to help control these situations. The default could be to use
> the T1/T2 values. It's a pretty simple patch: add the new parameters to
> CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into
> clusterDateMR() and put into conf; a little tricky to get the
> CanopyClusterer initialized right in the reducer (perhaps a new constructor
> to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise
> it and it would be a nice addition to Mahout. I can probably get to it by
> this weekend if nobody else wants to attempt it.
> >
> > Worth doing?
> >
> > -----Original Message-----
> > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > Sent: Tuesday, March 01, 2011 2:22 AM
> > To: user@mahout.apache.org
> > Subject: RE: T1 and T2 in Canopy
> >
> > Thank you Jeff for your advice,
> >
> > I think that the problems I encounter are characteristic for the
> structure of our dataset. The cardinality of the vectors is 20K, whereas an
> average number of non-zero coordinates is ~50. I checked with a sample that
> on average 12% of the distances between the vectors are maximum (i.e. there
> is no overlap in the non-zero coordinates). Moreover, the same values of T1
> and T2 are used in mappers and in a reducer. Which imposes another challenge
> as the distances among the centroids transferred to the reducer probably
> have different distribution than the distances between pure vectors.
> >
> > The process blows up either at the very begining (too many centroids are
> created in mappers) or after the mappers transfer the centroids to the
> reducer (as I see there is only one reducer hard-coded and everything has to
> be processed by one node)
> >
> > Cheers
> > Szymon
> >
> >
> > Dnia 28 lutego 2011 22:25 Jeff Eastman <je...@Narus.com> napisał(a):
> >
> > > Canopy can be difficult to control and it appears you may have found a
> use case for not enforcing T1>T2 (we don't). It is curious, though, that the
> settings you have chosen assign points to canopies (dist<T2) but does not
> include all of their weights (T2>dist>T1) in the centroids. What happens if
> you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give
> you the same number of clusters, but it would also add the centers of the
> outliers (dist>1.15). Is this where your processing time blows up?
> > >
> > > -----Original Message-----
> > > From: Szymon Chojnacki [mailto:sajmmon@o2.pl]
> > > Sent: Monday, February 28, 2011 11:55 AM
> > > To: user@mahout.apache.org
> > > Subject: T1 and T2 in Canopy
> > >
> > > Hello,
> > >
> > > I am working with my colleague Tim within a Mahout-588 project (
> https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project
> is to compare mahout's clustering algorithms with Apache-Mail-Archives
> dataset (6 million emails). I have spent last few days trying to set such
> values of T1 and T2, which would give a non-trivial set of clusters (>1 and
> < # of all vectors). And would output the result within e.g. up to 3h.
> > >
> > > I would be greatful for your advice, as the only way I can do it was by
> breaking the rule from the wiki that (T1>T1). The problem is that if T1 is
> large than we get many non-empty coordinates in each canopy. And both memory
> and cpu demand grows. However, setting low T1 results in low T2, which leads
> to large number of canopies. And the same problem with memory and cpu.
> > >
> > > My understanding of the source code is that T1 and T2 are independent.
> So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after
> 40 mins.
> > >
> > > Thank you in advance for you suggestions on setting T1 and T2, and the
> importance of T1>T2 constraint.
> > >
> > > Kind regards
> > > Szymon
> > >
> > > ps.
> > > I described my struggle in detail in
> https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf
> .
> > >
> > >
> >
> >
>
> --
> Szymon Chojnacki
> http://www.ipipan.eu/~sch/
>



-- 
ksh: