You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/06/18 21:59:59 UTC

IndexOutOfBoundsException in RandomSeedGenerator.buildRandom()

Hi all,

Thanks once more for everyone's help so far, it's been extremely 
fruitful. I'm about 98% of the way finished with my first sprint, but 
unfortunately there is a single error on my second-to-last line of code.

Right after performing an eigen-decomposition using the 
DistributedLanczosSolver, I feed the outputs directly into the KMeans 
utility, RandomSeedGenerator, in order to create random cluster 
centroids for a given K. Unfortunately, during that buildRandom() method 
call, I hit an index out of bounds exception, and it seems to be an 
off-by-1 problem (for k=3, the arrays generated are only of length 2).

More detail to be found here: 
http://spectrallyclustered.wordpress.com/2010/06/18/sprint-1-so-very-close/

I think part of the problem is due to a lack of understanding of the 
LanczosSolver process. I do know that the eigenvectors are returned as 
rows in a matrix, in which case the data points I need to feed to KMeans 
are the columns. How does the desiredRank parameter fit in when it's 
returning a row matrix? The rule of thumb I'm using is that # of 
clusters = # of eigenvectors, is there any way to enforce this heuristic 
explicitly?

Any insights here would be greatly appreciated; I've posted a patch with 
my latest code on JIRA. Thanks so much!

Regards,
Shannon

Re: IndexOutOfBoundsException in RandomSeedGenerator.buildRandom()

Posted by Jake Mannix <ja...@gmail.com>.
Hi Shannon,

  For the svd/eigen decomposition, (desiredRank - 1) will be your number of
rows/eigenvectors (which may be the source of your off-by-1 issue), but
really, in practice you need to clean up spurious (repeat) eigenvectors,
using the the EigenVerificationJob.  The output of that job will be some
number strictly less than desiredRank, but might be more than just one less.
 The idea is that you should ask for "a handful" more than what you really
need (maybe 5-10% more), and then throw away some of the leftovers you don't
want.

  The right way to deal with this, I think, is for EigenVerificationJob to
be folded into DistributedLanczosJob, and it to do this for the user: take
desiredRank, multiply it by 1.1, 1.2 or so, and then do the final cleaning,
and throw away excess, returning exactly desiredRank vectors, as you are
expecting.

  -jake

On Fri, Jun 18, 2010 at 12:59 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi all,
>
> Thanks once more for everyone's help so far, it's been extremely fruitful.
> I'm about 98% of the way finished with my first sprint, but unfortunately
> there is a single error on my second-to-last line of code.
>
> Right after performing an eigen-decomposition using the
> DistributedLanczosSolver, I feed the outputs directly into the KMeans
> utility, RandomSeedGenerator, in order to create random cluster centroids
> for a given K. Unfortunately, during that buildRandom() method call, I hit
> an index out of bounds exception, and it seems to be an off-by-1 problem
> (for k=3, the arrays generated are only of length 2).
>
> More detail to be found here:
> http://spectrallyclustered.wordpress.com/2010/06/18/sprint-1-so-very-close/
>
> I think part of the problem is due to a lack of understanding of the
> LanczosSolver process. I do know that the eigenvectors are returned as rows
> in a matrix, in which case the data points I need to feed to KMeans are the
> columns. How does the desiredRank parameter fit in when it's returning a row
> matrix? The rule of thumb I'm using is that # of clusters = # of
> eigenvectors, is there any way to enforce this heuristic explicitly?
>
> Any insights here would be greatly appreciated; I've posted a patch with my
> latest code on JIRA. Thanks so much!
>
> Regards,
> Shannon
>