You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Shashikant Kore <sh...@gmail.com> on 2009/05/27 15:19:01 UTC

Centroid calculations with sparse vectors

Hi,

To calculate the centroid (say in Canopy clustering) of a set of
sparse vectors, all the non-zero weights are added for each term and
then divided by the cardinality of the vector. Which is the average of
weights of a term in all the vectors.

I have sparse vectors of cardinalty of 50,000+, but each vector has
only couple of hundreds of terms.  While calculating centroid,  for
each term, only few hundred documents with non-zero term weights
contribute to the total weight, but since it is divided by the
cardinalty(50,000), the final weight is miniscule.  This results into
small document being marked closer to the centroid as they have fewer
terms in them. The clusters don't look "right."

I am wondering if the term weights of centroid should be calculated by
considering only the non-zero elements.  That is, if a term has occurs
in 10 vectors, then the weight of the term in centroid is the average
of these 10 weight values.  I couldn't locate any literature which
specifically talks about the case of sparse vectors in centroid
calculation. Any pointers are appreciated.

Thanks,
--shashi

-- 
http://www.bandhan.com/

Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
THe problem is that the L^0 norm was being used and it was implemented
assuming that the sparse vector data structure could be queried to find the
number of zeroes.

This is, of course, incorrect.  The data structure can only give an upper
bound on the number of non-zero elements.

The other part of the problem was that L^0 was probably not the right
normalization to use.

On Thu, May 28, 2009 at 12:07 AM, Sean Owen <sr...@gmail.com> wrote:

> (Sorry if I am misunderstanding the question or calculation.)
>
> I think your point is, in a sparse vector, shouldn't we ignore the
> 'fake' zeroes we observe where vectors are missing an element?
>

Re: Centroid calculations with sparse vectors

Posted by Sean Owen <sr...@gmail.com>.
Yeah I agree that in many cases "no value" and 0 will be equivalent. I
probably misunderstand the problem domain, but in computing centroids
of points, it does seem that 0 and "dunno" are different. The average
of 0, 1 and 3 is 1.333, while the average of "dunno", 1, and 3 is
"dunno".

Perhaps the question still stands though... should the API not allow
the distinction, even if this problem does or doesn't need it?

On Thu, May 28, 2009 at 8:18 AM, Ted Dunning <te...@gmail.com> wrote:
> You are exactly correct that this is an important distinction.  But that
> wasn't really the problem.  In the problem as stated, the numbers were word
> counts which are quite plausibly zero in the absence of an observation.  For
> rating data, there is a very different situation.
>
> Even with counts were no data == 0, there is some question about the proper
> way to handle the zeros.  My own feeling is that the best interpretation is
> "not observed yet".  If you use a proper probabilistic interpretation, then
> if you have made many observations and still have a zero, then you have a
> constraint while if you have made very few observations and have a zero, it
> means little.  Again, with ratings data, the problem is different.  There it
> is entirely appropriate to treat no data as something that gives us little
> or no information unless the ratings are chosen by the user in which case no
> rating is actually (slightly) informative.
>
> How you combine both forms of data is another question entirely.
>
> On Thu, May 28, 2009 at 12:07 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> So there is no way to distinguish between 0 and "no value" --
>> but conceptually those two are quite different things. Am I right
>> about this? I think the API would have to change then. I tripped on a
>> very similar problem early on in implementing cosine-measure / Pearson
>> correlation, for exactly the same reason. (Or perhaps I am just
>> projecting my past problem/solution here.)
>>
>

Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
You are exactly correct that this is an important distinction.  But that
wasn't really the problem.  In the problem as stated, the numbers were word
counts which are quite plausibly zero in the absence of an observation.  For
rating data, there is a very different situation.

Even with counts were no data == 0, there is some question about the proper
way to handle the zeros.  My own feeling is that the best interpretation is
"not observed yet".  If you use a proper probabilistic interpretation, then
if you have made many observations and still have a zero, then you have a
constraint while if you have made very few observations and have a zero, it
means little.  Again, with ratings data, the problem is different.  There it
is entirely appropriate to treat no data as something that gives us little
or no information unless the ratings are chosen by the user in which case no
rating is actually (slightly) informative.

How you combine both forms of data is another question entirely.

On Thu, May 28, 2009 at 12:07 AM, Sean Owen <sr...@gmail.com> wrote:

> So there is no way to distinguish between 0 and "no value" --
> but conceptually those two are quite different things. Am I right
> about this? I think the API would have to change then. I tripped on a
> very similar problem early on in implementing cosine-measure / Pearson
> correlation, for exactly the same reason. (Or perhaps I am just
> projecting my past problem/solution here.)
>

Re: Centroid calculations with sparse vectors

Posted by Sean Owen <sr...@gmail.com>.
(Sorry if I am misunderstanding the question or calculation.)

I think your point is, in a sparse vector, shouldn't we ignore the
'fake' zeroes we observe where vectors are missing an element?

Yes and no. Here, for any point you consider, you need that point's
position along all dimensions. So I don't think you can say, well,
vector 1 and 2 are at 3 along dimension 1, and vector 3 is at 0, but
I'll pretend it isn't there for my purposes. Why ignore 0? it seems
like a real value!

Or, if you say it's a 'fake' 0, then you cannot compute the centroid's
position in that dimension since you do not have enough information --
you don't know where vector 3 is in that dimension.

If they all had a 0 in the first dimension, you might say, well, none
of the vectors have any information about that dimension, so I'll
ignore it. If that's what you mean, I agree -- again, you have no real
information on which to compute a centroid position in that dimension.


So how do you tell what is fake, what is a real 0?

I actually think this is a problem with the representation... when you
set a dimension to 0, internally that dimension's value is removed/not
stored. So there is no way to distinguish between 0 and "no value" --
but conceptually those two are quite different things. Am I right
about this? I think the API would have to change then. I tripped on a
very similar problem early on in implementing cosine-measure / Pearson
correlation, for exactly the same reason. (Or perhaps I am just
projecting my past problem/solution here.)


On Thu, May 28, 2009 at 7:52 AM, Shashikant Kore <sh...@gmail.com> wrote:
> Jeff,
>
> Thank you for pointing the error. Not sure what I was thinking when I
> wrote cardinality as the denominator.
>
> My concern in the following code is that the total is divided by
> numPoints.  For a term,  only few of the numPoints vectors have
> contributed towards the weight. Rest had the value set to zero. That
> drags down the average and it much more pronounced in a large set of
> sparse vectors.
>
> For example, consider following doc vectors.
>
> v1 : [0:3,  1:6,   2:0, 3:3]
> v2:[0:3,  1:0,   2:0, 3:6]
> v3: [0:0,  1:0,   2:3, 3:0]
>
> The centroid will be :
>
> Centroid: [0:2,  1:2,   2:1, 3:3]
>
> The problem I face with existing centroid calculation is out of 100k
> documents, only a few thousand (or even lower) documents contribute
> the weight of a term. When that weight is divided by 100k, the weight
> comes very close to zero.  I am looking for ways to avoid that.
>
> If we consider only non-zero values, centroid will be
> Centroid: [0:3,  1:6,   2:3,  3:4.5]
>
> Is this centroid "better" if we are considering a large number of
> sparse vectors?
>
> --shashi
>
> On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman
> <jd...@windwardsolutions.com> wrote:
>> Hi Shashi,
>>
>> I'm not sure I understand your issue. The Canopy centroid calculation
>> divides the individual term totals by the number of points that have been
>> added to the cluster, not by the cardinality of the vector:
>>
>>  public Vector computeCentroid() {
>>   Vector result = new SparseVector(pointTotal.cardinality());
>>   for (int i = 0; i < pointTotal.cardinality(); i++)
>>     result.set(i, pointTotal.get(i) / numPoints);
>>   return result;
>>  }
>>
>> Am I misinterpreting something?
>> Jeff
>>
>> Shashikant Kore wrote:
>>>
>>> Hi,
>>>
>>> To calculate the centroid (say in Canopy clustering) of a set of
>>> sparse vectors, all the non-zero weights are added for each term and
>>> then divided by the cardinality of the vector. Which is the average of
>>> weights of a term in all the vectors.
>>>
>>> I have sparse vectors of cardinalty of 50,000+, but each vector has
>>> only couple of hundreds of terms.  While calculating centroid,  for
>>> each term, only few hundred documents with non-zero term weights
>>> contribute to the total weight, but since it is divided by the
>>> cardinalty(50,000), the final weight is miniscule.  This results into
>>> small document being marked closer to the centroid as they have fewer
>>> terms in them. The clusters don't look "right."
>>>
>>> I am wondering if the term weights of centroid should be calculated by
>>> considering only the non-zero elements.  That is, if a term has occurs
>>> in 10 vectors, then the weight of the term in centroid is the average
>>> of these 10 weight values.  I couldn't locate any literature which
>>> specifically talks about the case of sparse vectors in centroid
>>> calculation. Any pointers are appreciated.
>>>
>>> Thanks,
>>> --shashi
>>>
>>>
>>
>

Re: Centroid calculations with sparse vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Thanks for the clarifications, they make the nomenclature much clearer 
and expose its composition. This is a great thread and I'd like to 
incorporate this information in the wiki so it's easier for readers to 
find going forward.

Jeff


Ted Dunning wrote:
> Actually, I realize I was being a bit misleading.  There are several
> concepts swirling around here that fall into groups
>
> distance, metric, norm:
>
> A *metric* is some real-valued, non-negative function of pairs of things.  A
> metric should only give 0 for identical points.
>
> A *distance* is a metric that satisfies the triangle inequality.
>
> A *norm* is a thing that assigns lengths to vectors.  If you have
> subtraction, then norm(x-y) is a distance.  Often, the term norm is used
> loosely for functions where f(x-y) only gives a metric rather than a norm.
>
> centroid, medoid, mean, median, average:
>
> A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is
> almost always L^2
>
> A *medoid* is a single point x_m from a set of points that minimizes sum_i
> norm(x_i - x_m) for some norm (usually L^2)
>
> A *mean* is a one-dimensional centroid using L^2
>
> An *median* is a one-dimensional centroid using L^1 instead of L^2.  This
> value is not normally unique so there are various methods for tie-breaking.
>
> If you have a continuous space (like we have been talking about), you can
> generalize centroids, but for L^1, the corresponding centroid can be costly
> to compute except in one dimension.
>
> The place that I ran things off the rails was in conflating centroid with
> the norm that defines the centroid.  The L^2 norm is indeed computed as the
> root sum of squares for our purposes.  But the centroid based on this norm
> is computed by averaging all of the points and the original algorithm was
> correct except that numPoints must be the number of points involved in the
> centroid computation rather than the number of non-zero elements in a single
> point.
>
> It would indeed be a good idea to think about alternative clustering.  The
> move from L^2 to L^1 generally is a good thing if you want to decrease the
> sometimes overwhelming impact of outliers, but it can make things much
> harder to compute.  The move from L^2 to L^1 converts k-means to k-medoids
> and I expect that wikipedia's article on same will be pretty good.
>
> There is also a probabilistic interpretation for all of this.  For L^1 and
> L^2, you can interpret minimizing the distance in the centroid computation
> as finding a probability distribution that has maximum likelihood.  For L^2,
> the form of the distribution is the normal distribution: exp(-(x -
> mu)^2/(2s^2)).  For L^1, the distribution is the so-called Laplace
> distribution: exp( -abs(x-mu)/s ).  In general, the Laplace distribution is
> fat-tailed which is what gives it its outlier forgiving nature.
>
> Hope this clarification helps.  Sorry for adding to confusion.
>
> On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
>> search of "L2 norm" which turned up a number of other citations too. Since
>> we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
>> DistanceMeasures it might also be appropriate to allow the cluster centroid
>> normalization to be pluggable. I had never considered that. Another mental
>> lightbulb has just turned on. Thanks Ted.
>>
>>     
>
>
>
>   


Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
No it isn't always a good idea, but it is often a good idea for some kinds
of input.

More specifically, if the input is of the sort that is generated by
something with normally distributed values, then normalizing the way you did
is probably bad and it would be better to standardize the input (adjust to
zero mean, unit variance by translation and scaling) or just leave it alone.

If the input doesn't have that sort of error process, then you need to
transform it into something that does.  Count data, for example, doesn't
have the right kind of distribution because a direct L2 or L1 comparison of
different counts mostly just tells you which sample had more trials rather
than what is really different.  Dividing by the sum of the counts (aka L1
normalization) gives you estimates of multinomial probabilities which kind
of do have normal distribution so you will be good there.

Other length dependent data sources might require normalization using L2.

Paradoxically, L2 normalization is very commonly used for term counts from
documents rather than L1.  It isn't clear what will actually be better.
Frankly, I would rather move to a more advanced analysis than worry about
that difference.

On Mon, Jun 1, 2009 at 6:12 AM, Shashikant Kore <sh...@gmail.com>wrote:

> From this issue, it seems the input vectors should be L1/L2
> normalized. Is it a good idea to always normalize the input document
> vectors? If yes, can we make appropriate changes to JIRA 126 (create
> document vectors from text)?
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Centroid calculations with sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.
Hi Ted,

Looks like I misread your original post (which had error.)  I must
confess that I did not get 100% of what you said in clarification.
Nevertheless, it seemed to help in resolving the problem.

I normalized the input vectors by using L2 norm. (Does that statement
sound right?) That is each term weight was divided by the root of the
sum of squares of weights.  There is no change in the centroid
calculation. Centroid remains as average weight (sum of weights /
numPoints).  Ran Canopy followed by K means clusters to get results.
Results look good now.

The weights in centroid vary between 1e-3 and 1e-7. So, that is as
expected.  Deciding threshold for canopy generation looks be tricky.
T1, T2 values of 1.3 and 0.9 produce 145 canopies. Changing these
values to 1.4 and 1.0 result into a single canopy.

>From this issue, it seems the input vectors should be L1/L2
normalized. Is it a good idea to always normalize the input document
vectors? If yes, can we make appropriate changes to JIRA 126 (create
document vectors from text)?

--shashi

On Sat, May 30, 2009 at 1:00 AM, Ted Dunning <te...@gmail.com> wrote:
> On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore <sh...@gmail.com>wrote:
>
>> I tried L1 and L2 norms. The centroid definitely looks better, but the
>> values are still close to zero.
>>
>
> How close is that?  1e-3? (that I would expect) or 1e-300? (that would be
> wrong)
>
>
>
>> Please let me know if my understanding of L1, L2 norms is correct as
>> shown with the code below.
>>
>
> You understood what I said, but I said the wrong thing.  See my (oops)
> posting a few messages back.
>

Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore <sh...@gmail.com>wrote:

> I tried L1 and L2 norms. The centroid definitely looks better, but the
> values are still close to zero.
>

How close is that?  1e-3? (that I would expect) or 1e-300? (that would be
wrong)



> Please let me know if my understanding of L1, L2 norms is correct as
> shown with the code below.
>

You understood what I said, but I said the wrong thing.  See my (oops)
posting a few messages back.


 // L1 Norm  ... centroid is difficult to calculate in a few lines of code

 // what you had for the L1 norm is actually the centroid computation for
the L2 norm
 public Vector computeCentroid() {
 Vector result = new SparseVector(pointTotal.cardinality());
 double sum = pointTotal.zSum();
 for (int i = 0; i < pointTotal.cardinality(); i++)
   result.set(i, pointTotal.get(i) / sum);  // Dividing each coordinate
value by the
sum of all weights
 return result;
 }

Re: Centroid calculations with sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.
Hi Ted,

I tried L1 and L2 norms. The centroid definitely looks better, but the
values are still close to zero.

Please let me know if my understanding of L1, L2 norms is correct as
shown with the code below.

// L1 Norm
 public Vector computeCentroid() {
  Vector result = new SparseVector(pointTotal.cardinality());
  double sum = pointTotal.zSum();
  for (int i = 0; i < pointTotal.cardinality(); i++)
    result.set(i, pointTotal.get(i) / sum);  // Dividing the by the
sum of all weights
  return result;
 }


// L2 Norm
 public Vector computeCentroid() {
  Vector result = new SparseVector(pointTotal.cardinality());
  double sumSqrt = Math.sqrt(pointTotal.squaredSum());   // Assume the
method squaredSum returns the sum of squares of the term weights
  for (int i = 0; i < pointTotal.cardinality(); i++)
    result.set(i, pointTotal.get(i) / sumSqurt);
  return result;
 }

In the document vector, the weights are TF-IDF value as given by
DefaultSimilarity's tf() and idf().

Thanks,

--shashi

On Thu, May 28, 2009 at 9:14 PM, Ted Dunning <te...@gmail.com> wrote:
> Actually, I realize I was being a bit misleading.  There are several
> concepts swirling around here that fall into groups
>
> distance, metric, norm:
>
> A *metric* is some real-valued, non-negative function of pairs of things.  A
> metric should only give 0 for identical points.
>
> A *distance* is a metric that satisfies the triangle inequality.
>
> A *norm* is a thing that assigns lengths to vectors.  If you have
> subtraction, then norm(x-y) is a distance.  Often, the term norm is used
> loosely for functions where f(x-y) only gives a metric rather than a norm.
>
> centroid, medoid, mean, median, average:
>
> A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is
> almost always L^2
>
> A *medoid* is a single point x_m from a set of points that minimizes sum_i
> norm(x_i - x_m) for some norm (usually L^2)
>
> A *mean* is a one-dimensional centroid using L^2
>
> An *median* is a one-dimensional centroid using L^1 instead of L^2.  This
> value is not normally unique so there are various methods for tie-breaking.
>
> If you have a continuous space (like we have been talking about), you can
> generalize centroids, but for L^1, the corresponding centroid can be costly
> to compute except in one dimension.
>
> The place that I ran things off the rails was in conflating centroid with
> the norm that defines the centroid.  The L^2 norm is indeed computed as the
> root sum of squares for our purposes.  But the centroid based on this norm
> is computed by averaging all of the points and the original algorithm was
> correct except that numPoints must be the number of points involved in the
> centroid computation rather than the number of non-zero elements in a single
> point.
>
> It would indeed be a good idea to think about alternative clustering.  The
> move from L^2 to L^1 generally is a good thing if you want to decrease the
> sometimes overwhelming impact of outliers, but it can make things much
> harder to compute.  The move from L^2 to L^1 converts k-means to k-medoids
> and I expect that wikipedia's article on same will be pretty good.
>
> There is also a probabilistic interpretation for all of this.  For L^1 and
> L^2, you can interpret minimizing the distance in the centroid computation
> as finding a probability distribution that has maximum likelihood.  For L^2,
> the form of the distribution is the normal distribution: exp(-(x -
> mu)^2/(2s^2)).  For L^1, the distribution is the so-called Laplace
> distribution: exp( -abs(x-mu)/s ).  In general, the Laplace distribution is
> fat-tailed which is what gives it its outlier forgiving nature.
>
> Hope this clarification helps.  Sorry for adding to confusion.
>
> On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
>> search of "L2 norm" which turned up a number of other citations too. Since
>> we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
>> DistanceMeasures it might also be appropriate to allow the cluster centroid
>> normalization to be pluggable. I had never considered that. Another mental
>> lightbulb has just turned on. Thanks Ted.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>



-- 
http://www.bandhan.com/

Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
Actually, I realize I was being a bit misleading.  There are several
concepts swirling around here that fall into groups

distance, metric, norm:

A *metric* is some real-valued, non-negative function of pairs of things.  A
metric should only give 0 for identical points.

A *distance* is a metric that satisfies the triangle inequality.

A *norm* is a thing that assigns lengths to vectors.  If you have
subtraction, then norm(x-y) is a distance.  Often, the term norm is used
loosely for functions where f(x-y) only gives a metric rather than a norm.

centroid, medoid, mean, median, average:

A *centroid* is a value z that minimizes sum_i norm(x_i - z) where norm is
almost always L^2

A *medoid* is a single point x_m from a set of points that minimizes sum_i
norm(x_i - x_m) for some norm (usually L^2)

A *mean* is a one-dimensional centroid using L^2

An *median* is a one-dimensional centroid using L^1 instead of L^2.  This
value is not normally unique so there are various methods for tie-breaking.

If you have a continuous space (like we have been talking about), you can
generalize centroids, but for L^1, the corresponding centroid can be costly
to compute except in one dimension.

The place that I ran things off the rails was in conflating centroid with
the norm that defines the centroid.  The L^2 norm is indeed computed as the
root sum of squares for our purposes.  But the centroid based on this norm
is computed by averaging all of the points and the original algorithm was
correct except that numPoints must be the number of points involved in the
centroid computation rather than the number of non-zero elements in a single
point.

It would indeed be a good idea to think about alternative clustering.  The
move from L^2 to L^1 generally is a good thing if you want to decrease the
sometimes overwhelming impact of outliers, but it can make things much
harder to compute.  The move from L^2 to L^1 converts k-means to k-medoids
and I expect that wikipedia's article on same will be pretty good.

There is also a probabilistic interpretation for all of this.  For L^1 and
L^2, you can interpret minimizing the distance in the centroid computation
as finding a probability distribution that has maximum likelihood.  For L^2,
the form of the distribution is the normal distribution: exp(-(x -
mu)^2/(2s^2)).  For L^1, the distribution is the so-called Laplace
distribution: exp( -abs(x-mu)/s ).  In general, the Laplace distribution is
fat-tailed which is what gives it its outlier forgiving nature.

Hope this clarification helps.  Sorry for adding to confusion.

On Thu, May 28, 2009 at 5:51 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google
> search of "L2 norm" which turned up a number of other citations too. Since
> we already use L1 (Manhattan) and L2 (Euclidean) norms in the respective
> DistanceMeasures it might also be appropriate to allow the cluster centroid
> normalization to be pluggable. I had never considered that. Another mental
> lightbulb has just turned on. Thanks Ted.
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Centroid calculations with sparse vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I found this (http://en.wikipedia.org/wiki/Lp_space) with a quick Google 
search of "L2 norm" which turned up a number of other citations too. 
Since we already use L1 (Manhattan) and L2 (Euclidean) norms in the 
respective DistanceMeasures it might also be appropriate to allow the 
cluster centroid normalization to be pluggable. I had never considered 
that. Another mental lightbulb has just turned on. Thanks Ted.

Thinking more about DistanceMeasures, is it in general true that 
'distance(X, Y)' is just 'norm(X - Y)', for some choice of norm(x)? If 
so, would it be reasonable to factor the norm(x) out of the 
DistanceMeasures and make that be the pluggable clustering parameter? 
Hum, no, CosineDistanceMeasure normalizes X.dot(Y) by L2(X)*L2(Y), but 
some refactoring might still be possible along these lines. Perhaps a 
smaller number of DistanceMeasures would ensue.

Would users typically want to cluster using the same norm(x) for 
distance and centroid, or should they be independently pluggable?

Jeff


Shashikant Kore wrote:
> Tedd,
>
> L^1/L^2 Normalization sounds like a  good solution. I will try it out
> and report the results.
>
> Is there any literature available comparison of these normalization techniques?
>
> Thank you.
>
> --shashi
>
> On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <te...@gmail.com> wrote:
>   
>> Shashi,
>>
>> You are correct that this can be a problem, especially with vectors that
>> have a large number of elements that are zero, but not known to be such.
>>
>> The definition as it stands is roughly an L^0 normalization.  It is more
>> common in clustering to use an L^1 or L^2 normalization.  This would divide
>> the terms by, respectively, the sum of the elements or the square root of
>> the sum of the squares of the elements.  Both L^1 and L^2 normalization
>> avoids the problem you mention since negligibly small elements will not
>> contribute significantly to the norm.
>>
>> Traditionally, L^2 norms are used with documents.  This dates back to Salton
>> and the term-vector model of text retrieval.  That practice was, however,
>> based on somewhat inappropriate geometric intuitions.  Other norms are quite
>> plausibly more appropriate.  For instance, if normalized term frequencies
>> are considered to be estimates of word generation probabilities, then the
>> L^1 norm is much more appropriate.
>>
>> On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <sh...@gmail.com>wrote:
>>
>>     
>>> ...
>>> My concern in the following code is that the total is divided by
>>> numPoints.  For a term,  only few of the numPoints vectors have
>>> contributed towards the weight. Rest had the value set to zero. That
>>> drags down the average and it much more pronounced in a large set of
>>> sparse vectors.
>>>
>>>
>>>       
>
>
>   


Re: Centroid calculations with sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.
Tedd,

L^1/L^2 Normalization sounds like a  good solution. I will try it out
and report the results.

Is there any literature available comparison of these normalization techniques?

Thank you.

--shashi

On Thu, May 28, 2009 at 12:30 PM, Ted Dunning <te...@gmail.com> wrote:
> Shashi,
>
> You are correct that this can be a problem, especially with vectors that
> have a large number of elements that are zero, but not known to be such.
>
> The definition as it stands is roughly an L^0 normalization.  It is more
> common in clustering to use an L^1 or L^2 normalization.  This would divide
> the terms by, respectively, the sum of the elements or the square root of
> the sum of the squares of the elements.  Both L^1 and L^2 normalization
> avoids the problem you mention since negligibly small elements will not
> contribute significantly to the norm.
>
> Traditionally, L^2 norms are used with documents.  This dates back to Salton
> and the term-vector model of text retrieval.  That practice was, however,
> based on somewhat inappropriate geometric intuitions.  Other norms are quite
> plausibly more appropriate.  For instance, if normalized term frequencies
> are considered to be estimates of word generation probabilities, then the
> L^1 norm is much more appropriate.
>
> On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <sh...@gmail.com>wrote:
>
>> ...
>> My concern in the following code is that the total is divided by
>> numPoints.  For a term,  only few of the numPoints vectors have
>> contributed towards the weight. Rest had the value set to zero. That
>> drags down the average and it much more pronounced in a large set of
>> sparse vectors.
>>
>>
>

Re: Centroid calculations with sparse vectors

Posted by Ted Dunning <te...@gmail.com>.
Shashi,

You are correct that this can be a problem, especially with vectors that
have a large number of elements that are zero, but not known to be such.

The definition as it stands is roughly an L^0 normalization.  It is more
common in clustering to use an L^1 or L^2 normalization.  This would divide
the terms by, respectively, the sum of the elements or the square root of
the sum of the squares of the elements.  Both L^1 and L^2 normalization
avoids the problem you mention since negligibly small elements will not
contribute significantly to the norm.

Traditionally, L^2 norms are used with documents.  This dates back to Salton
and the term-vector model of text retrieval.  That practice was, however,
based on somewhat inappropriate geometric intuitions.  Other norms are quite
plausibly more appropriate.  For instance, if normalized term frequencies
are considered to be estimates of word generation probabilities, then the
L^1 norm is much more appropriate.

On Wed, May 27, 2009 at 11:52 PM, Shashikant Kore <sh...@gmail.com>wrote:

> ...
> My concern in the following code is that the total is divided by
> numPoints.  For a term,  only few of the numPoints vectors have
> contributed towards the weight. Rest had the value set to zero. That
> drags down the average and it much more pronounced in a large set of
> sparse vectors.
>
>

Re: Centroid calculations with sparse vectors

Posted by Shashikant Kore <sh...@gmail.com>.
Jeff,

Thank you for pointing the error. Not sure what I was thinking when I
wrote cardinality as the denominator.

My concern in the following code is that the total is divided by
numPoints.  For a term,  only few of the numPoints vectors have
contributed towards the weight. Rest had the value set to zero. That
drags down the average and it much more pronounced in a large set of
sparse vectors.

For example, consider following doc vectors.

v1 : [0:3,  1:6,   2:0, 3:3]
v2:[0:3,  1:0,   2:0, 3:6]
v3: [0:0,  1:0,   2:3, 3:0]

The centroid will be :

Centroid: [0:2,  1:2,   2:1, 3:3]

The problem I face with existing centroid calculation is out of 100k
documents, only a few thousand (or even lower) documents contribute
the weight of a term. When that weight is divided by 100k, the weight
comes very close to zero.  I am looking for ways to avoid that.

If we consider only non-zero values, centroid will be
Centroid: [0:3,  1:6,   2:3,  3:4.5]

Is this centroid "better" if we are considering a large number of
sparse vectors?

--shashi

On Thu, May 28, 2009 at 7:59 AM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Hi Shashi,
>
> I'm not sure I understand your issue. The Canopy centroid calculation
> divides the individual term totals by the number of points that have been
> added to the cluster, not by the cardinality of the vector:
>
>  public Vector computeCentroid() {
>   Vector result = new SparseVector(pointTotal.cardinality());
>   for (int i = 0; i < pointTotal.cardinality(); i++)
>     result.set(i, pointTotal.get(i) / numPoints);
>   return result;
>  }
>
> Am I misinterpreting something?
> Jeff
>
> Shashikant Kore wrote:
>>
>> Hi,
>>
>> To calculate the centroid (say in Canopy clustering) of a set of
>> sparse vectors, all the non-zero weights are added for each term and
>> then divided by the cardinality of the vector. Which is the average of
>> weights of a term in all the vectors.
>>
>> I have sparse vectors of cardinalty of 50,000+, but each vector has
>> only couple of hundreds of terms.  While calculating centroid,  for
>> each term, only few hundred documents with non-zero term weights
>> contribute to the total weight, but since it is divided by the
>> cardinalty(50,000), the final weight is miniscule.  This results into
>> small document being marked closer to the centroid as they have fewer
>> terms in them. The clusters don't look "right."
>>
>> I am wondering if the term weights of centroid should be calculated by
>> considering only the non-zero elements.  That is, if a term has occurs
>> in 10 vectors, then the weight of the term in centroid is the average
>> of these 10 weight values.  I couldn't locate any literature which
>> specifically talks about the case of sparse vectors in centroid
>> calculation. Any pointers are appreciated.
>>
>> Thanks,
>> --shashi
>>
>>
>

Re: Centroid calculations with sparse vectors

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Hi Shashi,

I'm not sure I understand your issue. The Canopy centroid calculation 
divides the individual term totals by the number of points that have 
been added to the cluster, not by the cardinality of the vector:

  public Vector computeCentroid() {
    Vector result = new SparseVector(pointTotal.cardinality());
    for (int i = 0; i < pointTotal.cardinality(); i++)
      result.set(i, pointTotal.get(i) / numPoints);
    return result;
  }

Am I misinterpreting something?
Jeff

Shashikant Kore wrote:
> Hi,
>
> To calculate the centroid (say in Canopy clustering) of a set of
> sparse vectors, all the non-zero weights are added for each term and
> then divided by the cardinality of the vector. Which is the average of
> weights of a term in all the vectors.
>
> I have sparse vectors of cardinalty of 50,000+, but each vector has
> only couple of hundreds of terms.  While calculating centroid,  for
> each term, only few hundred documents with non-zero term weights
> contribute to the total weight, but since it is divided by the
> cardinalty(50,000), the final weight is miniscule.  This results into
> small document being marked closer to the centroid as they have fewer
> terms in them. The clusters don't look "right."
>
> I am wondering if the term weights of centroid should be calculated by
> considering only the non-zero elements.  That is, if a term has occurs
> in 10 vectors, then the weight of the term in centroid is the average
> of these 10 weight values.  I couldn't locate any literature which
> specifically talks about the case of sparse vectors in centroid
> calculation. Any pointers are appreciated.
>
> Thanks,
> --shashi
>
>