You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2010/06/24 19:35:35 UTC

Re: Selectively discarding EigenVerification results and clustering assignments

Hey Shannon,

  I don't think that the EigenVerificationJob *modifies* any SequenceFiles -
that's a big no-no in Hadoop-land (data is write-once).  The output path for
the cleaned eigenvectors is "${mapred.output.dir}/largestCleanEigens/" -
look in EigenVerificationJob.saveCleanEigens().  It will give you as many
cleaned eigenvectors as it can get out of the ones that you gave it (ie.
every eigenvector which has error less than maxError, and eigenvalue greater
than minEigenvalue will be kept).

  If you wanted to add a parameter to that job "maxEigensToKeep", which
would prune off the smallest eigenvectors of the remaining cleaned set and
keep only that value, it would be a nice addition.

  I'm not exactly sure what you're asking about the cluster dumping...

  -jake

On Thu, Jun 24, 2010 at 5:03 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi all,
>
> Hopefully these two questions will be my last, at least until my next
> sprint... :)
>
> I've run the EigenVerification task, and from what I can tell it modifies
> the SequenceFiles themselves that contain the results of the LanczosSolver.
> My first question is fairly straightforward: since I need to do as Jake
> suggested earlier - set my desiredRank for the LanczosSolver as 1.2-1.5
> times what I actually want, then discard the highest-order eigenvectors down
> to exactly desiredRank - how do I actually perform the discard of the extra
> rows in the SequenceFiles? I tried making a DistributedRowMatrix out of the
> results and hard-setting the number of rows, but all the rows written by the
> LanczosSolver showed up.
>
> Part of this spectral clustering is to use the components of the
> eigenvectors as proxies for the real data, so after I've performed k-means
> clustering, I need to be able to read the cluster assignments
> programmatically, and transfer those assignments back to the original data.
> I know of the clusterdump tool, but to be honest I'm having trouble
> interpreting its output, plus I'm unsure of how I would output the cluster
> assignments from my program. It would seem, for compatibility purposes, that
> the format of clusterdump would be ideal, but I'm not sure how to do this
> when I'm proxying the cluster assignments. Any thoughts on this would be
> wonderful.
>
> Thank you!
>
> Shannon
>

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Shannon Quinn <sq...@gatech.edu>.

Please disregard the first question, I found the error...very 
embarrassingly simple mistake :o)

But I am still curious about the second:

> Also, why is the computePairwiseInnerProducts() method called in the 
> verification job's run(), but the return value (a VectorIterable) 
> never used?

Shannon

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Shannon Quinn <sq...@gatech.edu>.

Ok, here's another interesting bit that I somehow only just now 
uncovered: my verification job isn't returning any eigenvectors. When I 
hand the ${mapred.output.dir}/largestCleanEigen path off to the KMeans 
buildRandomSeed() method, it errors out with an 
IndexOutOfBoundsException after attempting to access index 0 (the actual 
sequence file supposedly containing the results has the header and 
nothing else in it).

I started with the parameters of minEigenValue = 0.01, maxError = 0.05, 
but relaxed these to 0.0 and 1, respectively, with no effect.

Also, why is the computePairwiseInnerProducts() method called in the 
verification job's run(), but the return value (a VectorIterable) never 
used?

Thanks!

Shannon

On 6/24/2010 2:25 PM, Jake Mannix wrote:
> On Thu, Jun 24, 2010 at 6:21 PM, Ted Dunning<te...@gmail.com>  wrote:
>
>    
>> I think that the normal nomenclature is to assume that the eigen-vectors
>> are
>> column vectors (hence the V' in the singular decomposition) and thus most
>> references would refer to clustering *rows* of the eigenvector matrix
>> (which
>> has one row per column of the original matrix and one column per
>> eigenvalue).
>>
>>      
> Everything in Distributed-Mahout matrix land is a row.
> We have no columns here, sorry to break convention. :)
>
>    -jake
>
>

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Jun 24, 2010 at 6:21 PM, Ted Dunning <te...@gmail.com> wrote:

> I think that the normal nomenclature is to assume that the eigen-vectors
> are
> column vectors (hence the V' in the singular decomposition) and thus most
> references would refer to clustering *rows* of the eigenvector matrix
> (which
> has one row per column of the original matrix and one column per
> eigenvalue).
>

Everything in Distributed-Mahout matrix land is a row.
We have no columns here, sorry to break convention. :)

  -jake

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Shannon Quinn <sq...@gatech.edu>.

Hi Jake and Ted,

>> Let me be clear in understanding this: you take the matrix of eigenvectors,
>> which has desiredRank rows, of originalSize columns each, and take the
>> *columns* of this matrix (all originalSize of them, each of which has
>> desiredRank entries) and cluster them with KMeans, right
>>      
> I think that the normal nomenclature is to assume that the eigen-vectors are
> column vectors (hence the V' in the singular decomposition) and thus most
> references would refer to clustering *rows* of the eigenvector matrix (which
> has one row per column of the original matrix and one column per
> eigenvalue).
>    

This is precisely it; I will have to load the cleaned eigenvectors into 
a DistributedRowMatrix and take its transpose in order to get the 
arrangement I'm looking for (unless KMeans can interpret the row vectors 
on its own?).

Still working on this patch...

Shannon

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Ted Dunning <te...@gmail.com>.

I think that the normal nomenclature is to assume that the eigen-vectors are
column vectors (hence the V' in the singular decomposition) and thus most
references would refer to clustering *rows* of the eigenvector matrix (which
has one row per column of the original matrix and one column per
eigenvalue).

It is sometimes really convenient to actually store the transpose of the
eigenvectors.

Jake is that what you are saying the Mahout decomposer does?

On Thu, Jun 24, 2010 at 11:06 AM, Jake Mannix <ja...@gmail.com> wrote:

> Let me be clear in understanding this: you take the matrix of eigenvectors,
> which has desiredRank rows, of originalSize columns each, and take the
> *columns* of this matrix (all originalSize of them, each of which has
> desiredRank entries) and cluster them with KMeans, right
>

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Jun 24, 2010 at 5:49 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
>
> I figured that was the case...I just found it strange that, if
> saveCleanEigens() was saving them somewhere else that was hard-coded into
> the method, that that output path wasn't specified by the caller, nor was it
> returned by the callee, so I just assumed the next logical (what seemed
> logical to me, anyway :P) pattern: that the original eigenvectors were
> overwritten.

Sensible, and you raise a valid point: having the output path be a parameter
which is transparently used to, well, put the output, is a good change if
you want to incorporate that into a patch.  Using a hardcoded subdirectory
isn't necessary in this case, it's just following the typical pattern of
having $output/dictionary, $output/someModelSubDir, $output/someOtherData,
in many of our jobs.  In this case, there's really only one subdirectory.

> Sorry, it was definitely unclear. Since I'm running Kmeans clustering on
> the matrix of eigenvectors as a proxy for running Kmeans on the actual data
> (where each component of the eigenvectors represents one of the original
> data points),

Let me be clear in understanding this: you take the matrix of eigenvectors,
which has desiredRank rows, of originalSize columns each, and take the
*columns* of this matrix (all originalSize of them, each of which has
desiredRank entries) and cluster them with KMeans, right?

> I need to, in effect, "transfer" the clustering assignments that Kmeans
> gives on the eigenvectors back to the original data. And then output those
> assignments, ideally in exactly the same format as Kmeans, or any of the
> other clustering algorithms. I looked into the Kmeans unit tests and feel
> like I can easily read off the clustering assignments and correlate them to
> the original data, but then I'm not sure how to output these correlations,
> since the clustering was done on the eigenvector components.
>

Well the nice thing at this point is that the output of KMeans is to give
assigments keyed on the original keys of the input matrix (I think!), and
produces a SequenceFile<IntWritable,WeightedVectorWritable>, and this
basically *should* be already correlated with your original data, directly.
You don't really want to just be using clusterdumer, that's just for seeing
stuff on the command line output...

Can someone more familiar with the KMeansDriver chime in here, maybe?

  -jake

Re: Selectively discarding EigenVerification results and clustering assignments

Posted by Shannon Quinn <sq...@gatech.edu>.

Hi Jake,

>    I don't think that the EigenVerificationJob *modifies* any SequenceFiles -
> that's a big no-no in Hadoop-land (data is write-once).  The output path for
> the cleaned eigenvectors is "${mapred.output.dir}/largestCleanEigens/" -
> look in EigenVerificationJob.saveCleanEigens().  It will give you as many
> cleaned eigenvectors as it can get out of the ones that you gave it (ie.
> every eigenvector which has error less than maxError, and eigenvalue greater
> than minEigenvalue will be kept).
>    

I figured that was the case...I just found it strange that, if 
saveCleanEigens() was saving them somewhere else that was hard-coded 
into the method, that that output path wasn't specified by the caller, 
nor was it returned by the callee, so I just assumed the next logical 
(what seemed logical to me, anyway :P) pattern: that the original 
eigenvectors were overwritten.

>    If you wanted to add a parameter to that job "maxEigensToKeep", which
> would prune off the smallest eigenvectors of the remaining cleaned set and
> keep only that value, it would be a nice addition.
>    

Ahhhh. Yes. This had crossed my mind, but I was just curious if there 
was anything I could do outside modifying the EigenVerifier itself to 
prune out unwanted vectors. Will do.

>    I'm not exactly sure what you're asking about the cluster dumping...
>    

Sorry, it was definitely unclear. Since I'm running Kmeans clustering on 
the matrix of eigenvectors as a proxy for running Kmeans on the actual 
data (where each component of the eigenvectors represents one of the 
original data points), I need to, in effect, "transfer" the clustering 
assignments that Kmeans gives on the eigenvectors back to the original 
data. And then output those assignments, ideally in exactly the same 
format as Kmeans, or any of the other clustering algorithms. I looked 
into the Kmeans unit tests and feel like I can easily read off the 
clustering assignments and correlate them to the original data, but then 
I'm not sure how to output these correlations, since the clustering was 
done on the eigenvector components.

Please let me know if that's still not clear. Thanks again!

Shannon