You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by "Yuji NISHIDA@U-Tokyo" <ni...@gmail.com> on 2012/08/04 02:25:07 UTC

mahout clusterdump output

Dear all

I am working on mahout to use canopy and kmeans and got a problem
about clusterdump output.
Each vector has simple number incremented from 1 as its name.

When I used 5,000 vectors, I got a correct output. It looks like:

VL-0{n=64,c=[...], r[...]}
    1.0: 1= [...]
    1.0: 3= [...]
    1.0: 4= [...]
     ...
    1.0: 396= [...]    # The number of vectors is exactly same as n(64).
VL-1{n=5,c=[...], r[...]}
    1.0: 2= [...]
    1.0: 12= [...]
    ...
    1.0: 4221= [...]
VL-2{n=121,c=[...], r[...]}
...

Each number of vectors in VL is exactly same as its n value.

When I used 600,000 vectors, the output looks wrong like:

VL-0{n=14,c=[...], r[...]}
    1.0: 66636= [...]
    1.0: 122570= [...]
    ...
    1.0: 522794= [...]    # The number of vectors is 31.
VL-8{n=0,c=[...], r[...]}
    1.0: 393539= [...]
    1.0: 398877= [...]
    ...
    1.0: 513448= [...]    # The number of vectors is 5.
VL-16{n=2,c=[...], r[...]}
...

It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
them existing in the output.
It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
2,10,18... and so on.

Can I believe this result or should I doubt this is caused by some bugs?

Hadoop : 0.20.204
Mahout : rev. 1351561, 1366995, 1367871

Best regards.

-- 
nishidy@u-tokyo

Re: mahout clusterdump output

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Mattie is correct on the VL/CL notation. Convergence; however, does not 
mean that the cluster centers have stopped moving, only that their 
movement is below a certain threshold. Thus, it is entirely possible for 
a few points observed to be in cluster X in the final iteration to be 
classified into cluster Y in the final clustering output since the 
centers of X and Y were adjusted slightly - but less than the threshold 
- after the final iteration ended. Certainly, decreasing the threshold 
will minimize this phenomena and in the limit can prevent it. This will 
require more iterations; however, and you need to assess the 
cost-benefit of this course of action.

On 9/10/12 2:03 PM, Whitmore, Mattie wrote:
> VL- means you have converged, which is good.  CL- means I have clusters which have not converged -- ie I need to run more iterations, or adjust my threshold.
>
> I don't use the commandline kmeans, rather I use the kmeansDriver api.  I have runClustering set as true.  Is this counting discrepancy just due to the fact I have not converged for some of my clusters -- so even though they are observed by a cluster they are not assigned to that cluster?
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Monday, September 10, 2012 12:46 PM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Thank you for your kind explanation.
> I added -cl option when conducting kmeans, so it seems no longer a problem.
>
> But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
> Do you think this is correct output?
>
> Best regards.
>
> 2012/9/11 Jeff Eastman <jd...@windwardsolutions.com>:
>> I think the discrepancy between the number (n=) of vectors reported by the
>> cluster and the number of points actually clustered by the -cl option is
>> normal.
>>
>> In the final iteration, points are assigned to (observed by) (classified as)
>> each cluster based upon the distance measure and the cluster center computed
>> from the previous iteration. The (n=) value records the number of points
>> "observed by" the cluster in that iteration.
>> After the final iteration, a new cluster center is calculated for each
>> cluster. This moves the center by some amount, less than the convergence
>> threshold, but it moves.
>> During the subsequent classification (-cl) step, these new centers are used
>> to classify the points for output. This will inevitably cause some points to
>> be assigned to (observed by) (classified as) a different cluster and so the
>> output clusteredPoints will reflect this final assignment.
>>
>> In small, contrived examples, the clustering will likely be more stable
>> between the final iteration and the output of clustered points.
>>
>>
>>
>> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>>
>> Hi,
>>
>> I too am having this problem.  I have a very small dimension space (3), and
>> a lot of vectors (hundreds of millions).  Therefore I can't print all to
>> disk (I receive an OOM error).  However, I can print 30 sample points
>> easily, and doing so showed results similar to you (I "named" my vectors to
>> be the number of vectors clusterDumper printed in the cluster):
>>
>> VL-50{n=0 c=[...] r=[]}
>>          Weight : [props - optional]:  Point:
>>          1.0:    1 = [...]
>>          1.0:    2 = [...]
>> 		...
>>          1.0:   10 = [...]
>>
>> --> note also radius is blank, whereas the points do have spread in all
>> dimensions, this happened ONLY with converged clusters.
>>
>> CL-51{n=4 c=[...] r=[...]}
>>          Weight : [props - optional]:  Point:
>>          1.0:    1 = [...]
>>          1.0:    2 = [...]
>> 		...
>>          1.0:    6 = [...]
>>
>> As far as I understand the algorithm, problems which arise due to
>> dimensionality are convergence problems.  Basically, distance between points
>> is "longer" as dimension increases (volume increases dramatically as
>> dimension increases).
>>
>> This shouldn't affect clusterDumper, as clusterDumper simply reports on
>> sequence files from a completed job.  This is why the discrepancy is not
>> making a lot of sense to me.  Having more vectors within each cluster makes
>> sense -- when I sum the printed n values, I receive a number magnitudes
>> smaller than the number of vectors I clustered.
>>
>> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>>
>>
>> -----Original Message-----
>> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
>> Sent: Sunday, September 09, 2012 4:46 AM
>> To: user@mahout.apache.org
>> Subject: Re: mahout clusterdump output
>>
>> Hi all
>>
>> I still want to confirm that this is not a problem.
>> Especially the n value, I just hope it is not problematic...
>>
>> I discussed this in my lab, one of our members noted that the dimension of
>> feature vectors and the number of vectors I used were very different.
>> I have used 100 dimensions of vector and 600,000 vectors.
>>
>> Do you think it may cause some problems if I use both small dimensions and
>> large number of vectors simultaneously and we need to make sure that there
>> is relation between them (especially in number)?
>> Or do you think 100 is too small for the dimension?
>>
>> I will appreciate very much that someone follows my question.
>>
>> Regards.
>>
>> 2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
>>
>> Dear all
>>
>> I am working on mahout to use canopy and kmeans and got a problem
>> about clusterdump output.
>> Each vector has simple number incremented from 1 as its name.
>>
>> When I used 5,000 vectors, I got a correct output. It looks like:
>>
>> VL-0{n=64,c=[...], r[...]}
>>      1.0: 1= [...]
>>      1.0: 3= [...]
>>      1.0: 4= [...]
>>       ...
>>      1.0: 396= [...]    # The number of vectors is exactly same as n(64).
>> VL-1{n=5,c=[...], r[...]}
>>      1.0: 2= [...]
>>      1.0: 12= [...]
>>      ...
>>      1.0: 4221= [...]
>> VL-2{n=121,c=[...], r[...]}
>> ...
>>
>> Each number of vectors in VL is exactly same as its n value.
>>
>> When I used 600,000 vectors, the output looks wrong like:
>>
>> VL-0{n=14,c=[...], r[...]}
>>      1.0: 66636= [...]
>>      1.0: 122570= [...]
>>      ...
>>      1.0: 522794= [...]    # The number of vectors is 31.
>> VL-8{n=0,c=[...], r[...]}
>>      1.0: 393539= [...]
>>      1.0: 398877= [...]
>>      ...
>>      1.0: 513448= [...]    # The number of vectors is 5.
>> VL-16{n=2,c=[...], r[...]}
>> ...
>>
>> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
>> them existing in the output.
>> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
>> 2,10,18... and so on.
>>
>> Can I believe this result or should I doubt this is caused by some bugs?
>>
>> Hadoop : 0.20.204
>> Mahout : rev. 1351561, 1366995, 1367871
>>
>> Best regards.
>>
>> --
>> nishidy@u-tokyo
>>
>>
>>
>
>

RE: mahout clusterdump output

Posted by "Whitmore, Mattie" <mw...@harris.com>.

VL- means you have converged, which is good.  CL- means I have clusters which have not converged -- ie I need to run more iterations, or adjust my threshold.

I don't use the commandline kmeans, rather I use the kmeansDriver api.  I have runClustering set as true.  Is this counting discrepancy just due to the fact I have not converged for some of my clusters -- so even though they are observed by a cluster they are not assigned to that cluster?

-----Original Message-----
From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com] 
Sent: Monday, September 10, 2012 12:46 PM
To: user@mahout.apache.org
Subject: Re: mahout clusterdump output

Thank you for your kind explanation.
I added -cl option when conducting kmeans, so it seems no longer a problem.

But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
Do you think this is correct output?

Best regards.

2012/9/11 Jeff Eastman <jd...@windwardsolutions.com>:
> I think the discrepancy between the number (n=) of vectors reported by the
> cluster and the number of points actually clustered by the -cl option is
> normal.
>
> In the final iteration, points are assigned to (observed by) (classified as)
> each cluster based upon the distance measure and the cluster center computed
> from the previous iteration. The (n=) value records the number of points
> "observed by" the cluster in that iteration.
> After the final iteration, a new cluster center is calculated for each
> cluster. This moves the center by some amount, less than the convergence
> threshold, but it moves.
> During the subsequent classification (-cl) step, these new centers are used
> to classify the points for output. This will inevitably cause some points to
> be assigned to (observed by) (classified as) a different cluster and so the
> output clusteredPoints will reflect this final assignment.
>
> In small, contrived examples, the clustering will likely be more stable
> between the final iteration and the output of clustered points.
>
>
>
> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>
> Hi,
>
> I too am having this problem.  I have a very small dimension space (3), and
> a lot of vectors (hundreds of millions).  Therefore I can't print all to
> disk (I receive an OOM error).  However, I can print 30 sample points
> easily, and doing so showed results similar to you (I "named" my vectors to
> be the number of vectors clusterDumper printed in the cluster):
>
> VL-50{n=0 c=[...] r=[]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:   10 = [...]
>
> --> note also radius is blank, whereas the points do have spread in all
> dimensions, this happened ONLY with converged clusters.
>
> CL-51{n=4 c=[...] r=[...]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:    6 = [...]
>
> As far as I understand the algorithm, problems which arise due to
> dimensionality are convergence problems.  Basically, distance between points
> is "longer" as dimension increases (volume increases dramatically as
> dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on
> sequence files from a completed job.  This is why the discrepancy is not
> making a lot of sense to me.  Having more vectors within each cluster makes
> sense -- when I sum the printed n values, I receive a number magnitudes
> smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
>
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo
>
>
>



-- 
nishidy@u-tokyo

Re: mahout clusterdump output

Posted by "Yuji NISHIDA@U-Tokyo" <ni...@gmail.com>.

Thank you for your kind explanation.
I added -cl option when conducting kmeans, so it seems no longer a problem.

But I also want to make sure that my clusterDump result shows "VL-", not "CL-".
Do you think this is correct output?

Best regards.

2012/9/11 Jeff Eastman <jd...@windwardsolutions.com>:
> I think the discrepancy between the number (n=) of vectors reported by the
> cluster and the number of points actually clustered by the -cl option is
> normal.
>
> In the final iteration, points are assigned to (observed by) (classified as)
> each cluster based upon the distance measure and the cluster center computed
> from the previous iteration. The (n=) value records the number of points
> "observed by" the cluster in that iteration.
> After the final iteration, a new cluster center is calculated for each
> cluster. This moves the center by some amount, less than the convergence
> threshold, but it moves.
> During the subsequent classification (-cl) step, these new centers are used
> to classify the points for output. This will inevitably cause some points to
> be assigned to (observed by) (classified as) a different cluster and so the
> output clusteredPoints will reflect this final assignment.
>
> In small, contrived examples, the clustering will likely be more stable
> between the final iteration and the output of clustered points.
>
>
>
> On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
>
> Hi,
>
> I too am having this problem.  I have a very small dimension space (3), and
> a lot of vectors (hundreds of millions).  Therefore I can't print all to
> disk (I receive an OOM error).  However, I can print 30 sample points
> easily, and doing so showed results similar to you (I "named" my vectors to
> be the number of vectors clusterDumper printed in the cluster):
>
> VL-50{n=0 c=[...] r=[]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:   10 = [...]
>
> --> note also radius is blank, whereas the points do have spread in all
> dimensions, this happened ONLY with converged clusters.
>
> CL-51{n=4 c=[...] r=[...]}
>         Weight : [props - optional]:  Point:
>         1.0:    1 = [...]
>         1.0:    2 = [...]
> 		...
>         1.0:    6 = [...]
>
> As far as I understand the algorithm, problems which arise due to
> dimensionality are convergence problems.  Basically, distance between points
> is "longer" as dimension increases (volume increases dramatically as
> dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on
> sequence files from a completed job.  This is why the discrepancy is not
> making a lot of sense to me.  Having more vectors within each cluster makes
> sense -- when I sum the printed n values, I receive a number magnitudes
> smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
>
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo
>
>
>



-- 
nishidy@u-tokyo

Re: mahout clusterdump output

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I think the discrepancy between the number (n=) of vectors reported by 
the cluster and the number of points actually clustered by the -cl 
option is normal.

  * In the final iteration, points are assigned to (observed by)
    (classified as) each cluster based upon the distance measure and the
    cluster center computed from the previous iteration. The (n=) value
    records the number of points "observed by" the cluster in that
    iteration.
  * After the final iteration, a new cluster center is calculated for
    each cluster. This moves the center by some amount, less than the
    convergence threshold, but it moves.
  * During the subsequent classification (-cl) step, these new centers
    are used to classify the points for output. This will inevitably
    cause some points to be assigned to (observed by) (classified as) a
    different cluster and so the output clusteredPoints will reflect
    this final assignment.

In small, contrived examples, the clustering will likely be more stable 
between the final iteration and the output of clustered points.



On 9/10/12 9:06 AM, Whitmore, Mattie wrote:
> Hi,
>
> I too am having this problem.  I have a very small dimension space (3), and a lot of vectors (hundreds of millions).  Therefore I can't print all to disk (I receive an OOM error).  However, I can print 30 sample points easily, and doing so showed results similar to you (I "named" my vectors to be the number of vectors clusterDumper printed in the cluster):
>
> VL-50{n=0 c=[...] r=[]}
>          Weight : [props - optional]:  Point:
>          1.0:    1 = [...]
>          1.0:    2 = [...]
> 		...
>          1.0:   10 = [...]
>
> --> note also radius is blank, whereas the points do have spread in all dimensions, this happened ONLY with converged clusters.
>          
> CL-51{n=4 c=[...] r=[...]}
>          Weight : [props - optional]:  Point:
>          1.0:    1 = [...]
>          1.0:    2 = [...]
> 		...
>          1.0:    6 = [...]
>
> As far as I understand the algorithm, problems which arise due to dimensionality are convergence problems.  Basically, distance between points is "longer" as dimension increases (volume increases dramatically as dimension increases).
>
> This shouldn't affect clusterDumper, as clusterDumper simply reports on sequence files from a completed job.  This is why the discrepancy is not making a lot of sense to me.  Having more vectors within each cluster makes sense -- when I sum the printed n values, I receive a number magnitudes smaller than the number of vectors I clustered.
>
> I used Mahout v0.7, Hadoop 0.20.2-cdh3u3
>
>
> -----Original Message-----
> From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com]
> Sent: Sunday, September 09, 2012 4:46 AM
> To: user@mahout.apache.org
> Subject: Re: mahout clusterdump output
>
> Hi all
>
> I still want to confirm that this is not a problem.
> Especially the n value, I just hope it is not problematic...
>
> I discussed this in my lab, one of our members noted that the dimension of
> feature vectors and the number of vectors I used were very different.
> I have used 100 dimensions of vector and 600,000 vectors.
>
> Do you think it may cause some problems if I use both small dimensions and
> large number of vectors simultaneously and we need to make sure that there
> is relation between them (especially in number)?
> Or do you think 100 is too small for the dimension?
>
> I will appreciate very much that someone follows my question.
>
> Regards.
>
> 2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
>> Dear all
>>
>> I am working on mahout to use canopy and kmeans and got a problem
>> about clusterdump output.
>> Each vector has simple number incremented from 1 as its name.
>>
>> When I used 5,000 vectors, I got a correct output. It looks like:
>>
>> VL-0{n=64,c=[...], r[...]}
>>      1.0: 1= [...]
>>      1.0: 3= [...]
>>      1.0: 4= [...]
>>       ...
>>      1.0: 396= [...]    # The number of vectors is exactly same as n(64).
>> VL-1{n=5,c=[...], r[...]}
>>      1.0: 2= [...]
>>      1.0: 12= [...]
>>      ...
>>      1.0: 4221= [...]
>> VL-2{n=121,c=[...], r[...]}
>> ...
>>
>> Each number of vectors in VL is exactly same as its n value.
>>
>> When I used 600,000 vectors, the output looks wrong like:
>>
>> VL-0{n=14,c=[...], r[...]}
>>      1.0: 66636= [...]
>>      1.0: 122570= [...]
>>      ...
>>      1.0: 522794= [...]    # The number of vectors is 31.
>> VL-8{n=0,c=[...], r[...]}
>>      1.0: 393539= [...]
>>      1.0: 398877= [...]
>>      ...
>>      1.0: 513448= [...]    # The number of vectors is 5.
>> VL-16{n=2,c=[...], r[...]}
>> ...
>>
>> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
>> them existing in the output.
>> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
>> 2,10,18... and so on.
>>
>> Can I believe this result or should I doubt this is caused by some bugs?
>>
>> Hadoop : 0.20.204
>> Mahout : rev. 1351561, 1366995, 1367871
>>
>> Best regards.
>>
>> --
>> nishidy@u-tokyo
>
>

RE: mahout clusterdump output

Posted by "Whitmore, Mattie" <mw...@harris.com>.

Hi,

I too am having this problem.  I have a very small dimension space (3), and a lot of vectors (hundreds of millions).  Therefore I can't print all to disk (I receive an OOM error).  However, I can print 30 sample points easily, and doing so showed results similar to you (I "named" my vectors to be the number of vectors clusterDumper printed in the cluster):

VL-50{n=0 c=[...] r=[]}
        Weight : [props - optional]:  Point:
        1.0:    1 = [...]
        1.0:    2 = [...]
		...
        1.0:   10 = [...]

--> note also radius is blank, whereas the points do have spread in all dimensions, this happened ONLY with converged clusters.
        
CL-51{n=4 c=[...] r=[...]}
        Weight : [props - optional]:  Point:
        1.0:    1 = [...]
        1.0:    2 = [...]
		...
        1.0:    6 = [...]

As far as I understand the algorithm, problems which arise due to dimensionality are convergence problems.  Basically, distance between points is "longer" as dimension increases (volume increases dramatically as dimension increases).  

This shouldn't affect clusterDumper, as clusterDumper simply reports on sequence files from a completed job.  This is why the discrepancy is not making a lot of sense to me.  Having more vectors within each cluster makes sense -- when I sum the printed n values, I receive a number magnitudes smaller than the number of vectors I clustered.  

I used Mahout v0.7, Hadoop 0.20.2-cdh3u3


-----Original Message-----
From: Yuji NISHIDA@U-Tokyo [mailto:nishidyatutokyo@gmail.com] 
Sent: Sunday, September 09, 2012 4:46 AM
To: user@mahout.apache.org
Subject: Re: mahout clusterdump output

Hi all

I still want to confirm that this is not a problem.
Especially the n value, I just hope it is not problematic...

I discussed this in my lab, one of our members noted that the dimension of
feature vectors and the number of vectors I used were very different.
I have used 100 dimensions of vector and 600,000 vectors.

Do you think it may cause some problems if I use both small dimensions and
large number of vectors simultaneously and we need to make sure that there
is relation between them (especially in number)?
Or do you think 100 is too small for the dimension?

I will appreciate very much that someone follows my question.

Regards.

2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo



-- 
nishidy@u-tokyo

Re: mahout clusterdump output

Posted by "Yuji NISHIDA@U-Tokyo" <ni...@gmail.com>.

Hi all

I still want to confirm that this is not a problem.
Especially the n value, I just hope it is not problematic...

I discussed this in my lab, one of our members noted that the dimension of
feature vectors and the number of vectors I used were very different.
I have used 100 dimensions of vector and 600,000 vectors.

Do you think it may cause some problems if I use both small dimensions and
large number of vectors simultaneously and we need to make sure that there
is relation between them (especially in number)?
Or do you think 100 is too small for the dimension?

I will appreciate very much that someone follows my question.

Regards.

2012/8/4 Yuji NISHIDA@U-Tokyo <ni...@gmail.com>:
> Dear all
>
> I am working on mahout to use canopy and kmeans and got a problem
> about clusterdump output.
> Each vector has simple number incremented from 1 as its name.
>
> When I used 5,000 vectors, I got a correct output. It looks like:
>
> VL-0{n=64,c=[...], r[...]}
>     1.0: 1= [...]
>     1.0: 3= [...]
>     1.0: 4= [...]
>      ...
>     1.0: 396= [...]    # The number of vectors is exactly same as n(64).
> VL-1{n=5,c=[...], r[...]}
>     1.0: 2= [...]
>     1.0: 12= [...]
>     ...
>     1.0: 4221= [...]
> VL-2{n=121,c=[...], r[...]}
> ...
>
> Each number of vectors in VL is exactly same as its n value.
>
> When I used 600,000 vectors, the output looks wrong like:
>
> VL-0{n=14,c=[...], r[...]}
>     1.0: 66636= [...]
>     1.0: 122570= [...]
>     ...
>     1.0: 522794= [...]    # The number of vectors is 31.
> VL-8{n=0,c=[...], r[...]}
>     1.0: 393539= [...]
>     1.0: 398877= [...]
>     ...
>     1.0: 513448= [...]    # The number of vectors is 5.
> VL-16{n=2,c=[...], r[...]}
> ...
>
> It looks VL-1 to VL-7 and VL-9 to VL-15 are not used but I confirmed
> them existing in the output.
> It seems using VL in order as 0,8,16,...,11552, 1,9,17,...,11553,
> 2,10,18... and so on.
>
> Can I believe this result or should I doubt this is caused by some bugs?
>
> Hadoop : 0.20.204
> Mahout : rev. 1351561, 1366995, 1367871
>
> Best regards.
>
> --
> nishidy@u-tokyo



-- 
nishidy@u-tokyo