You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by gabeweb <ga...@htc.com> on 2011/02/10 10:42:44 UTC

Problem in distributed canopy clustering

Hi, I think there is a significant problem in the distributed canopy
clusterer. I've been comparing the in-memory version to the distributed
version (clustering users in the GroupLens database), and they behave
completely differently. Firstly, different T1/T2 parameters are required to
get the same number of clusters -- even when the data and similarity metric
are exactly the same. Secondly, even when I have tuned the parameters to
get the same number of clusters, the distribution of cluster sizes is very
different -- in particular, using e.g. Tanimoto distance, if there are N
clusters, the distributed version likes to create N-1 singleton clusters,
and put all the remaining vectors into the remaining cluster.

I have traced this to the fact that given a single similarity metric,
distances between sparse vectors tend to have a different range than
distances between dense vectors. It first clusters (sparse) original
vectors in each mapper, and then it takes the (dense) centroid vectors
output by each mapper and applies the same canopy clustering using the same
T1/T2 parameters. I confirmed this by using a single mapper and simply
turning off the clustering of the reducing step (i.e., have the reducer just
output the same centroids that are input to it); in this case, the
clustering is fine -- somewhat obviously, perhaps, because this makes the
distributed algorithm behave exactly like the in-memory version.
Specifically, with Tanimoto distance and the reducer effectively turned off,
the average distance between original vectors is 0.984, and with T1 = T2 =
0.983 with 10% of the GroupLens data, I get 24 clusters. Then if I turn on
the reducer, I only get one cluster, because the average distance between
the dense centroids output by the mapper drops to 0.235, and so every
centroid is now within T1 of every other centroid. If I want a similar
number clusters in the unmodified distributed version, I have to decrease
T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
(the largest cluster now contains 6779 vectors, which is 97% of the input
vectors, as opposed to 2684 in the in-memory/turned-off-reducer version),
due to some property of the mapper having generated many more clusters (257)
as a trade-off for the T1/T2 now being appropriate for the different
similarity values of the reducer stage.

Is this a known shortcoming of distributed canopy? Or am I missing
something? It seems to me that for this to work, different T1/T2 parameters
would be needed for the mapper and reducer steps. That would be easy to
program, but it would make tuning the parameters a lot harder -- unless
there were some clever way to automatically adjust the parameters based on
how sparse the vectors being clustered were.

Thanks.
--
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-distributed-canopy-clustering-tp2464896p2464896.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Problem in distributed canopy clustering

Posted by Ted Dunning <te...@gmail.com>.

Thanks for a careful analysis and well-written comment.

On Thu, Feb 10, 2011 at 1:42 AM, gabeweb <ga...@htc.com> wrote:

>
> Hi, I think there is a significant problem in the distributed canopy
> clusterer.  I've been comparing the in-memory version to the distributed
> version (clustering users in the GroupLens database), and they behave
> completely differently.  Firstly, different T1/T2 parameters are required
> to
> get the same number of clusters -- even when the data and similarity metric
> are exactly the same.  Secondly, even when I have tuned the parameters to
> get the same number of clusters, the distribution of cluster sizes is very
> different -- in particular, using e.g. Tanimoto distance, if there are N
> clusters, the distributed version likes to create N-1 singleton clusters,
> and put all the remaining vectors into the remaining cluster.
>
> I have traced this to the fact that given a single similarity metric,
> distances between sparse vectors tend to have a different range than
> distances between dense vectors.  It first clusters (sparse) original
> vectors in each mapper, and then it takes the (dense) centroid vectors
> output by each mapper and applies the same canopy clustering using the same
> T1/T2 parameters.  I confirmed this by using a single mapper and simply
> turning off the clustering of the reducing step (i.e., have the reducer
> just
> output the same centroids that are input to it); in this case, the
> clustering is fine -- somewhat obviously, perhaps, because this makes the
> distributed algorithm behave exactly like the in-memory version.
> Specifically, with Tanimoto distance and the reducer effectively turned
> off,
> the average distance between original vectors is 0.984, and with T1 = T2 =
> 0.983 with 10% of the GroupLens data, I get 24 clusters.  Then if I turn on
> the reducer, I only get one cluster, because the average distance between
> the dense centroids output by the mapper drops to 0.235, and so every
> centroid is now within T1 of every other centroid.  If I want a similar
> number clusters in the unmodified distributed version, I have to decrease
> T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
> (the largest cluster now contains 6779 vectors, which is 97% of the input
> vectors, as opposed to 2684 in the in-memory/turned-off-reducer version),
> due to some property of the mapper having generated many more clusters
> (257)
> as a trade-off for the T1/T2 now being appropriate for the different
> similarity values of the reducer stage.
>
> Is this a known shortcoming of distributed canopy?  Or am I missing
> something?  It seems to me that for this to work, different T1/T2
> parameters
> would be needed for the mapper and reducer steps.  That would be easy to
> program, but it would make tuning the parameters a lot harder -- unless
> there were some clever way to automatically adjust the parameters based on
> how sparse the vectors being clustered were.
>
> Thanks.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-distributed-canopy-clustering-tp2464896p2464896.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Re: Problem in distributed canopy clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Jeff,

Unfortunately no ideas from my side. But as you said Canopy is "fast,
single-pass approximate clustering algorithm", i.e. it is used as a
pre-clustering step, so this should be ok.

Regards, Vasil

On Fri, Feb 11, 2011 at 7:58 PM, Jeff Eastman <je...@narus.com> wrote:

> Hi Vasil,
>
> Your analysis is correct and illustrates the differences between the
> sequential and parallel versions of Canopy. The mapper clustering output
> does depend upon which data points are presented to which mapper and the
> extra level of processing done in the reducer will also modify the outcomes.
> Canopy is intended to be a fast, single-pass approximate clustering
> algorithm and the Mahout implementation was derived from a Google map/reduce
> training example. If you have some ideas on how to improve it we are
> certainly willing to consider them.
>
> Jeff
>
> -----Original Message-----
> From: Vasil Vasilev [mailto:vavasilev@gmail.com]
> Sent: Friday, February 11, 2011 6:37 AM
> To: user@mahout.apache.org
> Subject: Re: Problem in distributed canopy clustering
>
> I meant T1=4.1 and T2=3.1
>
> On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <va...@gmail.com>
> wrote:
>
> > Sorry, I have a mistake in my point. The problem will occur in case x1
> and
> > x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4
> are
> > outside T1 range.
> > Say all vectors are 1-dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> > T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters
> in
> > the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 =
> 2.67
> > and cluster2 = (5+6)/2 = 5.5
> > The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> > clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> > For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> > (2+6)/2 = 4 and cluster2m2 = 6
> > The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
> >
> > In case I have a mistake somewhere in my calculations or I omit
> something,
> > please ignore my comment
> >
> >
> > On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <vavasilev@gmail.com
> >wrote:
> >
> >> Hi all,
> >>
> >> I also experienced similar problem when I tired to cluster the synthetic
> >> control data. I have a slightly different version of the data in which
> each
> >> control chart line is represented by a 3-dimensional vector (dimension1
> -
> >> the trend of the line, dimension2 - how often it changes direction,
> >> dimension3 - what is the maximum shift) and in this manner all vectors
> are
> >> dense.
> >>
> >> Prompted by this discussion I took a look at the code for the
> distributed
> >> version and I noticed that with the proposed implementation the
> clustering
> >> of the data will be very much dependent on the fact in what portions
> data
> >> are presented to the mappers. Let me give you an example: say we have 4
> >> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other
> and
> >> x3 and x4 are very close to each other (within T2 boundary). Let's also
> >> assume that x1 and x3 are apart from each other (outside T1 boundary)
> and
> >> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that
> for
> >> processing data 2 mappers are instantiated and the first mapper takes
> points
> >> x1 and x3 and the second mapper takes points x2 and x4. The result will
> be 2
> >> canopies, whose centers are very close to each other. At the reduce step
> >> these canopies will be merged in one canopy. In contrast the sequential
> >> version would have clustered the same data set into 2 canopies: canopy1
> will
> >> contain x1 and x2; canopy2 will contain x3 and x4
> >>
> >> Regards, Vasil
> >>
> >>
> >> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com
> >wrote:
> >>
> >>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
> >>> are denser than the input "(sparse) vectors" were. Your examples
> clarify
> >>> this point.
> >>>
> >>> -----Original Message-----
> >>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >>> Sent: Thursday, February 10, 2011 9:58 AM
> >>> To: user@mahout.apache.org
> >>> Subject: Re: Problem in distributed canopy clustering
> >>>
> >>> I don't think that Gabe was saying that the representation of the
> vectors
> >>> affects the arithmetic, only that denser vectors have different
> >>> statistics
> >>> than sparser vectors.  That is not so surprising.  Another way to look
> at
> >>> it
> >>> is to think of random unit vectors from a 1000 dimensional space with
> >>> only 1
> >>> non-zero component which has a value of 1.  Almost all vectors will
> have
> >>> zero dot products which is equivalent to a Euclidean distance of 1.4.
> >>>  One
> >>> out of a thousand pairs will have a distance of zero (dot product of
> 1).
> >>>
> >>> On the other hand, if you take the averages of batches of 300 of these
> >>> vectors, these averages will be much closer together to each other than
> >>> the
> >>> original vectors were.
> >>>
> >>> Taken a third way, if you take unit vectors distributed uniformly on a
> >>> sphere, the average distance will again be 1.4, but virtually none of
> the
> >>> vectors will have a distance of zero and many will have distance > 1.4
> +
> >>> epsilon or < 1.4 - epsilon.
> >>>
> >>> This means that the distances between first level canopies will be very
> >>> different from the distances between random vectors.
> >>>
> >>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com>
> >>> wrote:
> >>>
> >>> > But I don't understand why the DistanceMeasures are returning
> different
> >>> > values for Sparse and Dense vectors.
> >>>
> >>
> >>
> >
>

RE: Problem in distributed canopy clustering

Posted by Jeff Eastman <je...@Narus.com>.

Hi Vasil,

Your analysis is correct and illustrates the differences between the sequential and parallel versions of Canopy. The mapper clustering output does depend upon which data points are presented to which mapper and the extra level of processing done in the reducer will also modify the outcomes. Canopy is intended to be a fast, single-pass approximate clustering algorithm and the Mahout implementation was derived from a Google map/reduce training example. If you have some ideas on how to improve it we are certainly willing to consider them.

Jeff

-----Original Message-----
From: Vasil Vasilev [mailto:vavasilev@gmail.com] 
Sent: Friday, February 11, 2011 6:37 AM
To: user@mahout.apache.org
Subject: Re: Problem in distributed canopy clustering

I meant T1=4.1 and T2=3.1

On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <va...@gmail.com> wrote:

> Sorry, I have a mistake in my point. The problem will occur in case x1 and
> x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 are
> outside T1 range.
> Say all vectors are 1-dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters in
> the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 = 2.67
> and cluster2 = (5+6)/2 = 5.5
> The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> (2+6)/2 = 4 and cluster2m2 = 6
> The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
>
> In case I have a mistake somewhere in my calculations or I omit something,
> please ignore my comment
>
>
> On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <va...@gmail.com>wrote:
>
>> Hi all,
>>
>> I also experienced similar problem when I tired to cluster the synthetic
>> control data. I have a slightly different version of the data in which each
>> control chart line is represented by a 3-dimensional vector (dimension1 -
>> the trend of the line, dimension2 - how often it changes direction,
>> dimension3 - what is the maximum shift) and in this manner all vectors are
>> dense.
>>
>> Prompted by this discussion I took a look at the code for the distributed
>> version and I noticed that with the proposed implementation the clustering
>> of the data will be very much dependent on the fact in what portions data
>> are presented to the mappers. Let me give you an example: say we have 4
>> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other and
>> x3 and x4 are very close to each other (within T2 boundary). Let's also
>> assume that x1 and x3 are apart from each other (outside T1 boundary) and
>> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that for
>> processing data 2 mappers are instantiated and the first mapper takes points
>> x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
>> canopies, whose centers are very close to each other. At the reduce step
>> these canopies will be merged in one canopy. In contrast the sequential
>> version would have clustered the same data set into 2 canopies: canopy1 will
>> contain x1 and x2; canopy2 will contain x3 and x4
>>
>> Regards, Vasil
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <je...@narus.com>wrote:
>>
>>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>>> are denser than the input "(sparse) vectors" were. Your examples clarify
>>> this point.
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>> Sent: Thursday, February 10, 2011 9:58 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Problem in distributed canopy clustering
>>>
>>> I don't think that Gabe was saying that the representation of the vectors
>>> affects the arithmetic, only that denser vectors have different
>>> statistics
>>> than sparser vectors.  That is not so surprising.  Another way to look at
>>> it
>>> is to think of random unit vectors from a 1000 dimensional space with
>>> only 1
>>> non-zero component which has a value of 1.  Almost all vectors will have
>>> zero dot products which is equivalent to a Euclidean distance of 1.4.
>>>  One
>>> out of a thousand pairs will have a distance of zero (dot product of 1).
>>>
>>> On the other hand, if you take the averages of batches of 300 of these
>>> vectors, these averages will be much closer together to each other than
>>> the
>>> original vectors were.
>>>
>>> Taken a third way, if you take unit vectors distributed uniformly on a
>>> sphere, the average distance will again be 1.4, but virtually none of the
>>> vectors will have a distance of zero and many will have distance > 1.4 +
>>> epsilon or < 1.4 - epsilon.
>>>
>>> This means that the distances between first level canopies will be very
>>> different from the distances between random vectors.
>>>
>>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com>
>>> wrote:
>>>
>>> > But I don't understand why the DistanceMeasures are returning different
>>> > values for Sparse and Dense vectors.
>>>
>>
>>
>

Re: Problem in distributed canopy clustering

Posted by Vasil Vasilev <va...@gmail.com>.

I meant T1=4.1 and T2=3.1

On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <va...@gmail.com> wrote:

> Sorry, I have a mistake in my point. The problem will occur in case x1 and
> x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 are
> outside T1 range.
> Say all vectors are 1-dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters in
> the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 = 2.67
> and cluster2 = (5+6)/2 = 5.5
> The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> (2+6)/2 = 4 and cluster2m2 = 6
> The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
>
> In case I have a mistake somewhere in my calculations or I omit something,
> please ignore my comment
>
>
> On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <va...@gmail.com>wrote:
>
>> Hi all,
>>
>> I also experienced similar problem when I tired to cluster the synthetic
>> control data. I have a slightly different version of the data in which each
>> control chart line is represented by a 3-dimensional vector (dimension1 -
>> the trend of the line, dimension2 - how often it changes direction,
>> dimension3 - what is the maximum shift) and in this manner all vectors are
>> dense.
>>
>> Prompted by this discussion I took a look at the code for the distributed
>> version and I noticed that with the proposed implementation the clustering
>> of the data will be very much dependent on the fact in what portions data
>> are presented to the mappers. Let me give you an example: say we have 4
>> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other and
>> x3 and x4 are very close to each other (within T2 boundary). Let's also
>> assume that x1 and x3 are apart from each other (outside T1 boundary) and
>> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that for
>> processing data 2 mappers are instantiated and the first mapper takes points
>> x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
>> canopies, whose centers are very close to each other. At the reduce step
>> these canopies will be merged in one canopy. In contrast the sequential
>> version would have clustered the same data set into 2 canopies: canopy1 will
>> contain x1 and x2; canopy2 will contain x3 and x4
>>
>> Regards, Vasil
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <je...@narus.com>wrote:
>>
>>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>>> are denser than the input "(sparse) vectors" were. Your examples clarify
>>> this point.
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>> Sent: Thursday, February 10, 2011 9:58 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Problem in distributed canopy clustering
>>>
>>> I don't think that Gabe was saying that the representation of the vectors
>>> affects the arithmetic, only that denser vectors have different
>>> statistics
>>> than sparser vectors.  That is not so surprising.  Another way to look at
>>> it
>>> is to think of random unit vectors from a 1000 dimensional space with
>>> only 1
>>> non-zero component which has a value of 1.  Almost all vectors will have
>>> zero dot products which is equivalent to a Euclidean distance of 1.4.
>>>  One
>>> out of a thousand pairs will have a distance of zero (dot product of 1).
>>>
>>> On the other hand, if you take the averages of batches of 300 of these
>>> vectors, these averages will be much closer together to each other than
>>> the
>>> original vectors were.
>>>
>>> Taken a third way, if you take unit vectors distributed uniformly on a
>>> sphere, the average distance will again be 1.4, but virtually none of the
>>> vectors will have a distance of zero and many will have distance > 1.4 +
>>> epsilon or < 1.4 - epsilon.
>>>
>>> This means that the distances between first level canopies will be very
>>> different from the distances between random vectors.
>>>
>>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com>
>>> wrote:
>>>
>>> > But I don't understand why the DistanceMeasures are returning different
>>> > values for Sparse and Dense vectors.
>>>
>>
>>
>

Re: Problem in distributed canopy clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Sorry, I have a mistake in my point. The problem will occur in case x1 and
x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 are
outside T1 range.
Say all vectors are 1-dimensional and x1=1, x2=2, x3=5 and x4=6. Also T1=3.1
and T2=4.1 Then the sequential variant which iterates the clusters in the
order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 = 2.67 and
cluster2 = (5+6)/2 = 5.5
The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
(2+6)/2 = 4 and cluster2m2 = 6
The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5

In case I have a mistake somewhere in my calculations or I omit something,
please ignore my comment

On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <va...@gmail.com> wrote:

> Hi all,
>
> I also experienced similar problem when I tired to cluster the synthetic
> control data. I have a slightly different version of the data in which each
> control chart line is represented by a 3-dimensional vector (dimension1 -
> the trend of the line, dimension2 - how often it changes direction,
> dimension3 - what is the maximum shift) and in this manner all vectors are
> dense.
>
> Prompted by this discussion I took a look at the code for the distributed
> version and I noticed that with the proposed implementation the clustering
> of the data will be very much dependent on the fact in what portions data
> are presented to the mappers. Let me give you an example: say we have 4
> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other and
> x3 and x4 are very close to each other (within T2 boundary). Let's also
> assume that x1 and x3 are apart from each other (outside T1 boundary) and
> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that for
> processing data 2 mappers are instantiated and the first mapper takes points
> x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
> canopies, whose centers are very close to each other. At the reduce step
> these canopies will be merged in one canopy. In contrast the sequential
> version would have clustered the same data set into 2 canopies: canopy1 will
> contain x1 and x2; canopy2 will contain x3 and x4
>
> Regards, Vasil
>
>
> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <je...@narus.com> wrote:
>
>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>> are denser than the input "(sparse) vectors" were. Your examples clarify
>> this point.
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>> Sent: Thursday, February 10, 2011 9:58 AM
>> To: user@mahout.apache.org
>> Subject: Re: Problem in distributed canopy clustering
>>
>> I don't think that Gabe was saying that the representation of the vectors
>> affects the arithmetic, only that denser vectors have different statistics
>> than sparser vectors.  That is not so surprising.  Another way to look at
>> it
>> is to think of random unit vectors from a 1000 dimensional space with only
>> 1
>> non-zero component which has a value of 1.  Almost all vectors will have
>> zero dot products which is equivalent to a Euclidean distance of 1.4.  One
>> out of a thousand pairs will have a distance of zero (dot product of 1).
>>
>> On the other hand, if you take the averages of batches of 300 of these
>> vectors, these averages will be much closer together to each other than
>> the
>> original vectors were.
>>
>> Taken a third way, if you take unit vectors distributed uniformly on a
>> sphere, the average distance will again be 1.4, but virtually none of the
>> vectors will have a distance of zero and many will have distance > 1.4 +
>> epsilon or < 1.4 - epsilon.
>>
>> This means that the distances between first level canopies will be very
>> different from the distances between random vectors.
>>
>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com> wrote:
>>
>> > But I don't understand why the DistanceMeasures are returning different
>> > values for Sparse and Dense vectors.
>>
>
>

Re: Problem in distributed canopy clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hi all,

I also experienced similar problem when I tired to cluster the synthetic
control data. I have a slightly different version of the data in which each
control chart line is represented by a 3-dimensional vector (dimension1 -
the trend of the line, dimension2 - how often it changes direction,
dimension3 - what is the maximum shift) and in this manner all vectors are
dense.

Prompted by this discussion I took a look at the code for the distributed
version and I noticed that with the proposed implementation the clustering
of the data will be very much dependent on the fact in what portions data
are presented to the mappers. Let me give you an example: say we have 4
points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other and
x3 and x4 are very close to each other (within T2 boundary). Let's also
assume that x1 and x3 are apart from each other (outside T1 boundary) and
the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that for
processing data 2 mappers are instantiated and the first mapper takes points
x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
canopies, whose centers are very close to each other. At the reduce step
these canopies will be merged in one canopy. In contrast the sequential
version would have clustered the same data set into 2 canopies: canopy1 will
contain x1 and x2; canopy2 will contain x3 and x4

Regards, Vasil

On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <je...@narus.com> wrote:

> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors are
> denser than the input "(sparse) vectors" were. Your examples clarify this
> point.
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, February 10, 2011 9:58 AM
> To: user@mahout.apache.org
> Subject: Re: Problem in distributed canopy clustering
>
> I don't think that Gabe was saying that the representation of the vectors
> affects the arithmetic, only that denser vectors have different statistics
> than sparser vectors.  That is not so surprising.  Another way to look at
> it
> is to think of random unit vectors from a 1000 dimensional space with only
> 1
> non-zero component which has a value of 1.  Almost all vectors will have
> zero dot products which is equivalent to a Euclidean distance of 1.4.  One
> out of a thousand pairs will have a distance of zero (dot product of 1).
>
> On the other hand, if you take the averages of batches of 300 of these
> vectors, these averages will be much closer together to each other than the
> original vectors were.
>
> Taken a third way, if you take unit vectors distributed uniformly on a
> sphere, the average distance will again be 1.4, but virtually none of the
> vectors will have a distance of zero and many will have distance > 1.4 +
> epsilon or < 1.4 - epsilon.
>
> This means that the distances between first level canopies will be very
> different from the distances between random vectors.
>
> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > But I don't understand why the DistanceMeasures are returning different
> > values for Sparse and Dense vectors.
>

RE: Problem in distributed canopy clustering

Posted by Jeff Eastman <je...@Narus.com>.

Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors are denser than the input "(sparse) vectors" were. Your examples clarify this point.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, February 10, 2011 9:58 AM
To: user@mahout.apache.org
Subject: Re: Problem in distributed canopy clustering

I don't think that Gabe was saying that the representation of the vectors
affects the arithmetic, only that denser vectors have different statistics
than sparser vectors.  That is not so surprising.  Another way to look at it
is to think of random unit vectors from a 1000 dimensional space with only 1
non-zero component which has a value of 1.  Almost all vectors will have
zero dot products which is equivalent to a Euclidean distance of 1.4.  One
out of a thousand pairs will have a distance of zero (dot product of 1).

On the other hand, if you take the averages of batches of 300 of these
vectors, these averages will be much closer together to each other than the
original vectors were.

Taken a third way, if you take unit vectors distributed uniformly on a
sphere, the average distance will again be 1.4, but virtually none of the
vectors will have a distance of zero and many will have distance > 1.4 +
epsilon or < 1.4 - epsilon.

This means that the distances between first level canopies will be very
different from the distances between random vectors.

On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com> wrote:

> But I don't understand why the DistanceMeasures are returning different
> values for Sparse and Dense vectors.

Re: Problem in distributed canopy clustering

Posted by Ted Dunning <te...@gmail.com>.

I don't think that Gabe was saying that the representation of the vectors
affects the arithmetic, only that denser vectors have different statistics
than sparser vectors.  That is not so surprising.  Another way to look at it
is to think of random unit vectors from a 1000 dimensional space with only 1
non-zero component which has a value of 1.  Almost all vectors will have
zero dot products which is equivalent to a Euclidean distance of 1.4.  One
out of a thousand pairs will have a distance of zero (dot product of 1).

On the other hand, if you take the averages of batches of 300 of these
vectors, these averages will be much closer together to each other than the
original vectors were.

Taken a third way, if you take unit vectors distributed uniformly on a
sphere, the average distance will again be 1.4, but virtually none of the
vectors will have a distance of zero and many will have distance > 1.4 +
epsilon or < 1.4 - epsilon.

This means that the distances between first level canopies will be very
different from the distances between random vectors.

On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <je...@narus.com> wrote:

> But I don't understand why the DistanceMeasures are returning different
> values for Sparse and Dense vectors.

RE: Problem in distributed canopy clustering

Posted by Jeff Eastman <je...@Narus.com>.

There is a known and also documented (https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering first #7 in Running Canopy Clustering) difference between the sequential and distributed versions of Canopy: The distributed version runs two canopy passes whereas the sequential version does only one. But I don't understand why the DistanceMeasures are returning different values for Sparse and Dense vectors. I also don't understand how you are getting DenseVectors for your centroids in the first place; which version are you running? With trunk (and I think also 0.4 since this has not been changed recently) I get RandomAccessSparseVectors from the Mapper, even running Synthetic Control which starts with DenseVectors.

Jeff

-----Original Message-----
From: gabeweb [mailto:gabriel_webster@htc.com] 
Sent: Thursday, February 10, 2011 1:43 AM
To: mahout-user@lucene.apache.org
Subject: Problem in distributed canopy clustering

Hi, I think there is a significant problem in the distributed canopy
clusterer.  I've been comparing the in-memory version to the distributed
version (clustering users in the GroupLens database), and they behave
completely differently.  Firstly, different T1/T2 parameters are required to
get the same number of clusters -- even when the data and similarity metric
are exactly the same.  Secondly, even when I have tuned the parameters to
get the same number of clusters, the distribution of cluster sizes is very
different -- in particular, using e.g. Tanimoto distance, if there are N
clusters, the distributed version likes to create N-1 singleton clusters,
and put all the remaining vectors into the remaining cluster.

I have traced this to the fact that given a single similarity metric,
distances between sparse vectors tend to have a different range than
distances between dense vectors.  It first clusters (sparse) original
vectors in each mapper, and then it takes the (dense) centroid vectors
output by each mapper and applies the same canopy clustering using the same
T1/T2 parameters.  I confirmed this by using a single mapper and simply
turning off the clustering of the reducing step (i.e., have the reducer just
output the same centroids that are input to it); in this case, the
clustering is fine -- somewhat obviously, perhaps, because this makes the
distributed algorithm behave exactly like the in-memory version. 
Specifically, with Tanimoto distance and the reducer effectively turned off,
the average distance between original vectors is 0.984, and with T1 = T2 =
0.983 with 10% of the GroupLens data, I get 24 clusters.  Then if I turn on
the reducer, I only get one cluster, because the average distance between
the dense centroids output by the mapper drops to 0.235, and so every
centroid is now within T1 of every other centroid.  If I want a similar
number clusters in the unmodified distributed version, I have to decrease
T1/T2 to 0.939, which gives 23 clusters, but much less evenly distributed
(the largest cluster now contains 6779 vectors, which is 97% of the input
vectors, as opposed to 2684 in the in-memory/turned-off-reducer version),
due to some property of the mapper having generated many more clusters (257)
as a trade-off for the T1/T2 now being appropriate for the different
similarity values of the reducer stage.

Is this a known shortcoming of distributed canopy?  Or am I missing
something?  It seems to me that for this to work, different T1/T2 parameters
would be needed for the mapper and reducer steps.  That would be easy to
program, but it would make tuning the parameters a lot harder -- unless
there were some clever way to automatically adjust the parameters based on
how sparse the vectors being clustered were.

Thanks.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-distributed-canopy-clustering-tp2464896p2464896.html
Sent from the Mahout User List mailing list archive at Nabble.com.