You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/07/13 01:38:56 UTC

Low-rank transition matrix approximation

Hi all,

I have a question that's a bit more rooted in theory. In my clustering
algorithm, I depend on the markovian properties of the graph representation
of the data, using the eigenvectors/eigenvalues to determine where the
cluster boundaries are. The markov transition matrix is dense, as it
represents the full distribution of a random walk through the entire data
set.

Within the thesis for this algorithm (link:
http://dl.dropbox.com/u/1377610/eigencuts_thesis.pdf ), the author dedicates
an entire chapter (chpt 8) to a discussion on using successively lower-rank
sparse approximations of the full markov transition matrix in order to speed
up computation of the eigendecomposition (since there is a bottleneck in
that the decomposition is performed at every iterative step in the
clustering process).

My question is whether or not, on a framework like Mahout, this optimization
is necessary. I certainly don't doubt there would be a performance
improvement, but would it be noticeable across a distributed framework? If
the markov transition matrix is already a DistributedRowMatrix, and I'm
already using tools like the DistributedLanczosSolver, is an extra
map/reduce procedure for creating a lower-rank markov matrix really
necessary?

I'm very eager to hear other folks' thoughts on this. Thanks!

Regards,
Shannon

Re: Low-rank transition matrix approximation

Posted by Shannon Quinn <sq...@gatech.edu>.

>
>
> > is it really the case that you are always finding eigenvectors of a fully
> > *dense*
> > matrix?  The DRM and Lanczos impls in Mahout may really not be terribly
> > suited to this use case, if so...
>

The transition matrix itself is already pretty sparse. Particularly with
images, it's very rare that an arbitrary node has more than 8 connected
neighbors (corresponding to the neighboring pixels). While each row of the
transition matrix sums to 1 (the probability of moving from the node denoted
by the current row to any other node in the graph), most of these
transitions will be 0 since there won't be an edge.

The algorithm uses a slightly adjusted version of the transition matrix, one
which has a more stable eigen-decomposition. It's not quite as sparse as the
one I described, but it's pretty close.

The lower-rank approximation does make the matrix more sparse, so if you
think this is an issue, I could certainly look into it. I just hadn't
considered it until now.

Shannon

Re: Low-rank transition matrix approximation

Posted by Ted Dunning <te...@gmail.com>.

Shannon,

To add to Jake's voice on this,

dense + large == not scalable if only because of the n^2 entries.  Scalable
implies O(n) time and O(1) memory throughout on every machine involved in
the computation.  Even n log n doesn't work because it usually implies O(n)
memory.

For non-parallel algorithms, this implies on-line algorithms are pretty much
all that works.  For parallel programs, we usually assume something like at
most O(log n) machines in the cluster.

On Mon, Jul 19, 2010 at 11:25 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Tue, Jul 13, 2010 at 1:38 AM, Shannon Quinn <sq...@gatech.edu> wrote:
>
> > Hi all,
> >
> > I have a question that's a bit more rooted in theory. In my clustering
> > algorithm, I depend on the markovian properties of the graph
> representation
> > of the data, using the eigenvectors/eigenvalues to determine where the
> > cluster boundaries are. The markov transition matrix is dense, as it
> > represents the full distribution of a random walk through the entire data
> > set.
> >
>
> Hey Shannon,
>
>  I really wish I'd had the time to read more on the full theory of this
> algorithm -
> is it really the case that you are always finding eigenvectors of a fully
> *dense*
> matrix?  The DRM and Lanczos impls in Mahout may really not be terribly
> suited to this use case, if so...
>
>  Can the algorithm work if you "sparsify" your transition matrix, so that
> it
> only keeps the relatively significant entries, setting the smaller values
> to
> zero?
>
>  -jake
>

Re: Low-rank transition matrix approximation

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jul 13, 2010 at 1:38 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi all,
>
> I have a question that's a bit more rooted in theory. In my clustering
> algorithm, I depend on the markovian properties of the graph representation
> of the data, using the eigenvectors/eigenvalues to determine where the
> cluster boundaries are. The markov transition matrix is dense, as it
> represents the full distribution of a random walk through the entire data
> set.
>

Hey Shannon,

  I really wish I'd had the time to read more on the full theory of this
algorithm -
is it really the case that you are always finding eigenvectors of a fully
*dense*
matrix?  The DRM and Lanczos impls in Mahout may really not be terribly
suited to this use case, if so...

  Can the algorithm work if you "sparsify" your transition matrix, so that
it
only keeps the relatively significant entries, setting the smaller values to
zero?

  -jake