You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Fernando O." <fo...@gmail.com> on 2011/11/22 11:42:54 UTC

Clustering Question (from a newbie)

Hi all,
    Disclaimer: I'm a total newbie in datamining / clustering / AI / and
all the areas around.My knowledge of clustering is basically what I learn
in my cs regular courses but never did research/work with this before.

Any reading recomendation would be much appreciated :D

I'm trying to understand a large set of data: I have a set of Geographical
regions, and for each region I have N characteristics or categories, let's
say the measure that I have is something like an indicator of the
importance of that characteristic in that region.

So I have a table somthing like this
       C1      C2       C3
R1   80%   20%      0%
R2   75%   25%      0%
R3   50%   20%     30%

>From what I read Kmeans works pretty well for most cases, so I choosed to
use that clustering technique.
Then I used the Tanimoto Distance because I wanted to measure the
correlation between categories.

Right now I have a small set: 148 Regions and 13 Categories. From those 148
Regions only one has more than 1% in Cn, and it has in fact 36%.

So I would expect that if I set the number of clusters to something
relatively large (15 or 20) I would get a cluster with only that region
having Cn=36%

My problem is that I couldn't make it happen so I'm not sure why this is
happening. In fact I have some empty clusters.
R158,30%1,10%0,00%5,66%5,55%2,24%1,42%3,20%1,12%14,75%6,23%0,25%0,01%0,16%R2
37,08%1,95%0,00%26,27%4,86%0,11%0,00%0,00%0,76%7,78%18,16%0,00%0,00%0,00%R3
48,86%3,03%6,14%5,98%7,91%1,85%1,69%3,55%0,43%15,63%4,83%0,09%0,00%0,00%*R4*
*8,86%**0,59%**6,60%**2,46%**2,06%**1,26%**0,26%**1,71%**0,47%**6,11%**7,43%
**0,03%**61,96%**0,21%*R551,56%2,55%0,00%16,08%7,29%0,49%3,31%1,22%0,47%
13,49%3,53%0,01%0,00%0,00%R640,15%6,26%0,00%8,07%5,25%0,20%0,45%13,29%1,28%
12,85%11,64%0,00%0,00%0,55%

Running Kmeans like this:
KMeansDriver.run(conf, new Path("mahoutTest/regions"), new Path(
"testdata/clusters"), new Path("output"),
new TanimotoDistanceMeasure(), 0.001, 1000, true, false);

The vectors for each Region are in 1/100 (that 8.86 is 0.0886)

Any Idea of what I might be doing wrong ? (please don't say everything! :D )

Thanks a lot!

Re: Clustering Question (from a newbie)

Posted by Ted Dunning <te...@gmail.com>.

Make sure that you add it once on the top and N times on the bottom of the
expression (i.e. once for each category).

On Wed, Nov 23, 2011 at 12:44 AM, Fernando O. <fo...@gmail.com> wrote:

> I'll look into Kullback-Leibler and thanks a lot for noticing the \delta I
> do need it in fact!
>

Re: Clustering Question (from a newbie)

Posted by "Fernando O." <fo...@gmail.com>.

Hi Ted!
   Thanks a lot from your answer. At first I used the original counts I was
expecting that the resulting clusters would have some logic. I realized
that since most of the distance measures I was experimenting do something
like this: for 2 vectors v and e => some_calculationOn(v_i,e_i) .

After looking at my results I went back to think about the problem and I
realized that if I want to look at category weight then I would need to
express the weight of each category in each row.

I'll look into Kullback-Leibler and thanks a lot for noticing the \delta I
do need it in fact!

Cheers,
Fernando

On Tue, Nov 22, 2011 at 9:11 PM, Ted Dunning <te...@gmail.com> wrote:

> I would recommend that you work with the original counts instead of
> percentages.  That allows you to use statistical similarity measures based
> on the multinomial distribution.  The important thing that the counts
> provide over percentages is an understanding of how certain the
> distribution really is.
>
> If you move forward with using the percentages, I would consider using
> something like Kuhlback-Leibler divergence as a measure of dissimilarity.
>  You would need to smooth the probabilities when you derive them from the
> counts.  The simplest method for this is to introduce a simple prior into
> your estimates.  Then, if the count for each category i is k_i, you would
> estimate the percentage p_i as
>
>    p_i = (k_i + \delta) / \sum_j (k_j + \delta)
>
> This prevents you from ever estimating either 0 or 1 for these percentages
> and thus helps avoid log 0.  It also will tend to give you better results
> in a variety of ways.
>
> On Tue, Nov 22, 2011 at 1:46 PM, Fernando O. <fo...@gmail.com> wrote:
>
> > It's 148 not b/c I'm doing initial tests :D
> >
> > Yes, values add up to 1.
> >
> > For this example percentages are precalculated basically I get a total
> > number for each category and then convert it to percentages.
> >
> >
> > On Tue, Nov 22, 2011 at 6:10 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Do the category values add up to 1 for every row?
> > >
> > > Where do these percentages come from?
> > >
> > > At 148 rows, I would use R instead of Mahout.
> > >
> > > On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fo...@gmail.com> wrote:
> > >
> > > > So I have a table somthing like this
> > > >       C1      C2       C3
> > > > R1   80%   20%      0%
> > > > R2   75%   25%      0%
> > > > R3   50%   20%     30%
> > > >
> > > > From what I read Kmeans works pretty well for most cases, so I
> choosed
> > to
> > > > use that clustering technique.
> > > > Then I used the Tanimoto Distance because I wanted to measure the
> > > > correlation between categories.
> > > >
> > > > Right now I have a small set: 148 Regions and 13 Categories. From
> those
> > > 148
> > > > Regions only one has more than 1% in Cn, and it has in fact 36%.
> > > >
> > >
> >
>

Re: Clustering Question (from a newbie)

Posted by Ted Dunning <te...@gmail.com>.

I would recommend that you work with the original counts instead of
percentages.  That allows you to use statistical similarity measures based
on the multinomial distribution.  The important thing that the counts
provide over percentages is an understanding of how certain the
distribution really is.

If you move forward with using the percentages, I would consider using
something like Kuhlback-Leibler divergence as a measure of dissimilarity.
 You would need to smooth the probabilities when you derive them from the
counts.  The simplest method for this is to introduce a simple prior into
your estimates.  Then, if the count for each category i is k_i, you would
estimate the percentage p_i as

    p_i = (k_i + \delta) / \sum_j (k_j + \delta)

This prevents you from ever estimating either 0 or 1 for these percentages
and thus helps avoid log 0.  It also will tend to give you better results
in a variety of ways.

On Tue, Nov 22, 2011 at 1:46 PM, Fernando O. <fo...@gmail.com> wrote:

> It's 148 not b/c I'm doing initial tests :D
>
> Yes, values add up to 1.
>
> For this example percentages are precalculated basically I get a total
> number for each category and then convert it to percentages.
>
>
> On Tue, Nov 22, 2011 at 6:10 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Do the category values add up to 1 for every row?
> >
> > Where do these percentages come from?
> >
> > At 148 rows, I would use R instead of Mahout.
> >
> > On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fo...@gmail.com> wrote:
> >
> > > So I have a table somthing like this
> > >       C1      C2       C3
> > > R1   80%   20%      0%
> > > R2   75%   25%      0%
> > > R3   50%   20%     30%
> > >
> > > From what I read Kmeans works pretty well for most cases, so I choosed
> to
> > > use that clustering technique.
> > > Then I used the Tanimoto Distance because I wanted to measure the
> > > correlation between categories.
> > >
> > > Right now I have a small set: 148 Regions and 13 Categories. From those
> > 148
> > > Regions only one has more than 1% in Cn, and it has in fact 36%.
> > >
> >
>

Re: Clustering Question (from a newbie)

Posted by "Fernando O." <fo...@gmail.com>.

It's 148 not b/c I'm doing initial tests :D

Yes, values add up to 1.

For this example percentages are precalculated basically I get a total
number for each category and then convert it to percentages.


On Tue, Nov 22, 2011 at 6:10 PM, Ted Dunning <te...@gmail.com> wrote:

> Do the category values add up to 1 for every row?
>
> Where do these percentages come from?
>
> At 148 rows, I would use R instead of Mahout.
>
> On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fo...@gmail.com> wrote:
>
> > So I have a table somthing like this
> >       C1      C2       C3
> > R1   80%   20%      0%
> > R2   75%   25%      0%
> > R3   50%   20%     30%
> >
> > From what I read Kmeans works pretty well for most cases, so I choosed to
> > use that clustering technique.
> > Then I used the Tanimoto Distance because I wanted to measure the
> > correlation between categories.
> >
> > Right now I have a small set: 148 Regions and 13 Categories. From those
> 148
> > Regions only one has more than 1% in Cn, and it has in fact 36%.
> >
>

Re: Clustering Question (from a newbie)

Posted by Ted Dunning <te...@gmail.com>.

Do the category values add up to 1 for every row?

Where do these percentages come from?

At 148 rows, I would use R instead of Mahout.

On Tue, Nov 22, 2011 at 2:42 AM, Fernando O. <fo...@gmail.com> wrote:

> So I have a table somthing like this
>       C1      C2       C3
> R1   80%   20%      0%
> R2   75%   25%      0%
> R3   50%   20%     30%
>
> From what I read Kmeans works pretty well for most cases, so I choosed to
> use that clustering technique.
> Then I used the Tanimoto Distance because I wanted to measure the
> correlation between categories.
>
> Right now I have a small set: 148 Regions and 13 Categories. From those 148
> Regions only one has more than 1% in Cn, and it has in fact 36%.
>

Re: Clustering Question (from a newbie)

Posted by "Fernando O." <fo...@gmail.com>.

In ClusterIn I had #Categories clusters with initial centroid some
arbitrary vector (I was using the first #Categories vectors that got).

I realized that since I had percentages I could create arbitrary centroids
giving 0.5 value on the corresponding category and 0 on the others.

Turns out that it work really good :D I still have to take a better look
but it looks correct.

Now I'm wondering if there is any paper that supports my assumption

On Tue, Nov 22, 2011 at 7:50 AM, Paritosh Ranjan <pr...@xebia.com> wrote:

> public static void run(Configuration conf,
>                         Path input,
>                         Path clustersIn,
>                         Path output,...
>
> The second parameter is clustersIn. What are you providing there?
>
> I propose that you first use CanopyClustering to find the appropriate
> number of clusters present. And then give them as the input in clustersIn.
> You might be giving the wrong clustersIn which can create problems.
>
> Paritosh
>
>
> On 22-11-2011 16:12, Fernando O. wrote:
>
>> Hi all,
>>     Disclaimer: I'm a total newbie in datamining / clustering / AI / and
>> all the areas around.My knowledge of clustering is basically what I learn
>> in my cs regular courses but never did research/work with this before.
>>
>> Any reading recomendation would be much appreciated :D
>>
>> I'm trying to understand a large set of data: I have a set of Geographical
>> regions, and for each region I have N characteristics or categories, let's
>> say the measure that I have is something like an indicator of the
>> importance of that characteristic in that region.
>>
>> So I have a table somthing like this
>>        C1      C2       C3
>> R1   80%   20%      0%
>> R2   75%   25%      0%
>> R3   50%   20%     30%
>>
>>  From what I read Kmeans works pretty well for most cases, so I choosed to
>> use that clustering technique.
>> Then I used the Tanimoto Distance because I wanted to measure the
>> correlation between categories.
>>
>> Right now I have a small set: 148 Regions and 13 Categories. From those
>> 148
>> Regions only one has more than 1% in Cn, and it has in fact 36%.
>>
>> So I would expect that if I set the number of clusters to something
>> relatively large (15 or 20) I would get a cluster with only that region
>> having Cn=36%
>>
>> My problem is that I couldn't make it happen so I'm not sure why this is
>> happening. In fact I have some empty clusters.
>> R158,30%1,10%0,00%5,66%5,55%2,**24%1,42%3,20%1,12%14,75%6,23%**
>> 0,25%0,01%0,16%R2
>> 37,08%1,95%0,00%26,27%4,86%0,**11%0,00%0,00%0,76%7,78%18,16%**
>> 0,00%0,00%0,00%R3
>> 48,86%3,03%6,14%5,98%7,91%1,**85%1,69%3,55%0,43%15,63%4,83%**
>> 0,09%0,00%0,00%*R4*
>> *8,86%**0,59%**6,60%**2,46%****2,06%**1,26%**0,26%**1,71%**0,**
>> 47%**6,11%**7,43%
>> **0,03%**61,96%**0,21%*R551,**56%2,55%0,00%16,08%7,29%0,49%**
>> 3,31%1,22%0,47%
>> 13,49%3,53%0,01%0,00%0,00%**R640,15%6,26%0,00%8,07%5,25%0,**
>> 20%0,45%13,29%1,28%
>> 12,85%11,64%0,00%0,00%0,55%
>>
>>
>> Running Kmeans like this:
>> KMeansDriver.run(conf, new Path("mahoutTest/regions"), new Path(
>> "testdata/clusters"), new Path("output"),
>> new TanimotoDistanceMeasure(), 0.001, 1000, true, false);
>>
>> The vectors for each Region are in 1/100 (that 8.86 is 0.0886)
>>
>> Any Idea of what I might be doing wrong ? (please don't say everything!
>> :D )
>>
>> Thanks a lot!
>>
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1411 / Virus Database: 2092/4030 - Release Date: 11/21/11
>>
>
>

Re: Clustering Question (from a newbie)

Posted by Paritosh Ranjan <pr...@xebia.com>.

public static void run(Configuration conf,
                          Path input,
                          Path clustersIn,
                          Path output,...

The second parameter is clustersIn. What are you providing there?

I propose that you first use CanopyClustering to find the appropriate 
number of clusters present. And then give them as the input in 
clustersIn. You might be giving the wrong clustersIn which can create 
problems.

Paritosh

On 22-11-2011 16:12, Fernando O. wrote:
> Hi all,
>      Disclaimer: I'm a total newbie in datamining / clustering / AI / and
> all the areas around.My knowledge of clustering is basically what I learn
> in my cs regular courses but never did research/work with this before.
>
> Any reading recomendation would be much appreciated :D
>
> I'm trying to understand a large set of data: I have a set of Geographical
> regions, and for each region I have N characteristics or categories, let's
> say the measure that I have is something like an indicator of the
> importance of that characteristic in that region.
>
> So I have a table somthing like this
>         C1      C2       C3
> R1   80%   20%      0%
> R2   75%   25%      0%
> R3   50%   20%     30%
>
>  From what I read Kmeans works pretty well for most cases, so I choosed to
> use that clustering technique.
> Then I used the Tanimoto Distance because I wanted to measure the
> correlation between categories.
>
> Right now I have a small set: 148 Regions and 13 Categories. From those 148
> Regions only one has more than 1% in Cn, and it has in fact 36%.
>
> So I would expect that if I set the number of clusters to something
> relatively large (15 or 20) I would get a cluster with only that region
> having Cn=36%
>
> My problem is that I couldn't make it happen so I'm not sure why this is
> happening. In fact I have some empty clusters.
> R158,30%1,10%0,00%5,66%5,55%2,24%1,42%3,20%1,12%14,75%6,23%0,25%0,01%0,16%R2
> 37,08%1,95%0,00%26,27%4,86%0,11%0,00%0,00%0,76%7,78%18,16%0,00%0,00%0,00%R3
> 48,86%3,03%6,14%5,98%7,91%1,85%1,69%3,55%0,43%15,63%4,83%0,09%0,00%0,00%*R4*
> *8,86%**0,59%**6,60%**2,46%**2,06%**1,26%**0,26%**1,71%**0,47%**6,11%**7,43%
> **0,03%**61,96%**0,21%*R551,56%2,55%0,00%16,08%7,29%0,49%3,31%1,22%0,47%
> 13,49%3,53%0,01%0,00%0,00%R640,15%6,26%0,00%8,07%5,25%0,20%0,45%13,29%1,28%
> 12,85%11,64%0,00%0,00%0,55%
>
> Running Kmeans like this:
> KMeansDriver.run(conf, new Path("mahoutTest/regions"), new Path(
> "testdata/clusters"), new Path("output"),
> new TanimotoDistanceMeasure(), 0.001, 1000, true, false);
>
> The vectors for each Region are in 1/100 (that 8.86 is 0.0886)
>
> Any Idea of what I might be doing wrong ? (please don't say everything! :D )
>
> Thanks a lot!
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1411 / Virus Database: 2092/4030 - Release Date: 11/21/11