You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Steven Buss <st...@gmail.com> on 2010/02/24 22:53:35 UTC

Symmetric eigendecomposition for kernel PCA

I was chatting with Jake Mannix on twitter regarding mahout 180 and if
that patch is suitable for sparse symmetric positive-definite
matrices, and he suggested we continue the conversation on the mailing
list, so:

My research partner and I have a dataset that consists of 400,000
users and 1.6 million articles, with about 22 million nonzeros. We are
trying to use this data to make recommendations to users. We have
tried using the SVD and PLSI, both with unsatisfactory results, and
are now attempting kPCA.

We have a 400,000 by 400,000 sparse symmetric positive-definite
matrix, H, that we need the top couple hundred eigenvectors/values
for. Jake has told me that I can use mahout 180 unchanged, but it will
be doing redundant work and the output eigenvalues are the squares of
the ones we actually want. This sounds like a good approach, but it
would be great if mahout had an optimized eigendecomposition for
symmetric matrices. Jake suggested I submit a JIRA ticket regarding
this, which I plan to do.

H is the pairwise distance in feature space (calculated using a kernel
function) between each pair of users (or some subset of users). After
I mentioned this to Jake, he asked me "why aren't you just doing it
all in one go? Kernelize on the rows, and do SVD on that? Why do the
M*M^t intermediate step?" Unfortunately, I'm not sure what you're
asking, Jake, can you clarify?

Steven Buss
steven.buss@gmail.com
http://www.stevenbuss.com/

Re: Symmetric eigendecomposition for kernel PCA

Posted by Ted Dunning <te...@gmail.com>.

To clarify, my expectation is that this monster pass will let us do a fairly
general kernel AND the SVD in one pass.

On Fri, Feb 26, 2010 at 12:08 AM, Jake Mannix <ja...@gmail.com> wrote:

> stochastic decomposition work will certainly be the way
> to go, as it allows you to do it in *one* monster pass over the data.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Symmetric eigendecomposition for kernel PCA

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Feb 25, 2010 at 7:48 PM, Steven Buss <st...@gmail.com> wrote:
>
> I still don't see this. Hopefully I can explain my confusion. It seems
> to me that the complexity you wrote uses a kernel that does not
> require pair-wise computations. It doesn't look like this is actually
> operating on the Gram matrix, H.
>

This is correct, it's not using the full Gram matrix at all, which won't
yield
the same results as what you're doing for a general kernel K(u,v).

But it *does* work for kernels of the form K(u,v) = <k(u), k(v)>, where
the inner product is the euclidean inner product, and k() is a (possibly
complex, nonlinear) function from R^(X.numCols) to R^(dim_Kernel).

Let's go over in detail a little how this works and see if I'm smoking
crack:

> (call the data matrix X)
> I think a more accurate complexity would be O(k * numUsers *
> (kernelCost * numUsers)), which is the same as precomputing H. For
> every row, i, of X that the Lanczos algorithm comes to, we need to
> find k(X_i, X_{1...N}).
>

No, this is not what Lanczos does.  Let's drop the kernel for a
bit, and stick with the euclidean one.  Using Lanczos to get the SVD
of X can be done in several ways, as the SVD is looking for really
either the left singular vectors, or the right ones, because once you
have one set, you can reconstruct the other in a single pass over
your original data.  So we can either look for eigenvectors of
X*X^{t} (which are N-dimensional), or eigenvectors of X^{t}*X
(which are M-dimensional).  The most straightforward way to look
for eigenvectors of X*X^{t} is to first compute the full X*X^{t}
matrix, and then run (symmetric) Lanczos on the result to pull
out N-dimensional eigenvectors.

If instead we look for the M-dimensional eigenvectors of X^{t}*X,
by noting the following:

for each i = 1->M

 [ (X^{t}*X)*v ]_i = sum_j,k=1->N (X_ij X_jk v_k)

Ostensibly, this looks like an O(N^2 * M) computation, and
would be, if X was dense.

But since it's sparse, reordering this sum can make clear how few
of these dot products are really necessary:

sum_k (X_jk v_k) is just <X_j, v> and if X_j only has a few
nonzero entries, this computation is O(numEntries(X_j)), not
O(dim(X_j)) = M.

So to compute the vector

   v' = [(X^{t}*X)*v]  vectorSum_i (X_i * <X_i,v>)

you are just taking repeated sparse dot products of X_i with
your input (possibly dense: eigen!) vector v, and then using
that number to scale that (sparse!) row up by that factor, and
then accumulating that vector onto your final (probably dense)
output vector v'.  If the average density of each row of X_i
is "d", then this is O(d * N) FLOPs (4*d*N to be exact).

Lanczos then does this process k times:

  O(d*N*k) operations

To compute the full gram matrix, you can also play some
tricks, to find only the pairs of rows of X which intersect
(and thus have nontrivial dot product), but unless this process
is very clever (maybe your MapReduce passes you describe
are this clever, I haven't dug into them in any detail yet),
you maybe have a bunch of pairs of rows, X_i and X_j,
which are known to overlap, and to compute their inner
product, you still need to iterate over all d entries of the
shorter of X_i and X_j, to find out where they actually
overlap.

So let's say we know how which pairs of X_i and X_j
we need to compute dot products on.  If the distribution
of sparsity was fairly uniform (not the case, but it's only
going to be worse if they're non-uniform), then the
probability that X_i and X_j don't overlap at all is

  (1-(d/M))^d =approx (1-d^2/M) (sparse: d/M is small).

Then there are N^2 * d^2/M nonzero entries in the gram
matrix (it's still sparse, but in comparison to N*d
nonzero entries in X, it's denser by a multiplicative
factor of d*(N/M)), and to compute each one it can
take O(d) operations (not flops, but still cpu cycles
of walking a sparse vector's iterator), , and Lanczos
takes O(k * nonzeroEntries(A)) operations to get
k-eigenvectors of a matrix A, so all in all, in the
sparse case, it looks like compute-Gram-then-decompose
will take something like

O(k*N^2 * d^2 * d / M) = O(k * N * (d^3 (N/M) )]

operations, while Lanczos in one pass is

O(k * N * d).

Now if we get clever enough to turn that d^3 into
more like d^2, and we observe that at least in your case,
M is a bit more than N, and d is really small, then
this is probably a wash.  But in the usual case, where
N is big, and d is not tiny (maybe 10's to 100's), then
the factor of (N/M) * d is a multiplicative factor in cpu
cycles (as well as space, which is often neglected
unless we really are talking about a factor of 1000
bigger, in which case your Hadoop admin may want
to have a word with you).

The intuitive reason why folding the kernel into
X itself, and computing the SVD, instead of computing
the Gram matrix and doing eigendecomposition that,
is that essentially, eigendecomposition of a matrix A
is taking A, A^2, A^3, ... and using these in some
algebraic fashion to pull out eigenvectors.  The SVD
is getting the same vectors, but it's basically doing it
instead of over A = X*X^{t}, over X itself, the "square
root" of A.  Eigendecomposition is iterating over a
denser matrix than SVD is.

All of this was using the euclidean kernel, and does not
generalize to *all kernels*.  As long as you can represent
your kernel as just mapping the rows of X to some space
where the inner product can be computed as a euclidean
dot product, the entire argument goes over as is, where
M gets replaced with dim(KernelSpace), and d gets
replaced with numNonZeroEntries( k(X_i) ).

Let's say that I am completely misunderstanding what you're saying and
> that your suggestion works. We would then still have the problem of
> centering H. There may be tricks to do this row-wise, but I am pretty
> sure that the entire matrix has to be computed first, meaning that
> your suggestion wouldn't work out after all.
>

Centering may be problematic, depending on how you're meaning it.
You mean you're centering the rows of the Gram matrix?  If I knew how
and why you were centering, then maybe there'd be a way to salvage
the above technique, but I don't know.

Ok, I'm not sure if any of this made any sense, but either way, code
will soon be available in Mahout to allow for comparisons between
all these various techniques, and it may turn out that I/O costs
are so dominant that none of this matters except how much you can
minimize the number of passes over your data, in which case going
with the stochastic decomposition work will certainly be the way
to go, as it allows you to do it in *one* monster pass over the data.

  -jake

Re: Symmetric eigendecomposition for kernel PCA

Posted by Steven Buss <st...@gmail.com>.

> If instead you computed kernelized Lanczos all in one fell
> swoop, you'll be making k passes over the data, and computing
> the kernel each time, yes, but the scaling is just that of Lanczos:
>
> O(k * numUsers * kernelCost(userSparsity) )
>
> (as you mention, I forgot the kernel cost here, yes)
>
> This is tremendously better than computing the full kernel matrix,
> by a factor of k / numUsers = 5000 or so.  Unless I'm missing
> a factor somewhere.

I still don't see this. Hopefully I can explain my confusion. It seems
to me that the complexity you wrote uses a kernel that does not
require pair-wise computations. It doesn't look like this is actually
operating on the Gram matrix, H.

(call the data matrix X)
I think a more accurate complexity would be O(k * numUsers *
(kernelCost * numUsers)), which is the same as precomputing H. For
every row, i, of X that the Lanczos algorithm comes to, we need to
find k(X_i, X_{1...N}).

> Is the former, but the difference is between computing the k(r_i, r_j)
> for all i and j, and computing k(r_i, V) for all i, for a set of only K =
> 100's of dense eigenvectors (well, Lanczos vectors, used to produce
> eigenvectors).  Way way way less computation of the kernel.

I'm not sure what V is here. If V is an eigen (or Lanczos) vector,
isn't that of different dimensionality than r_i? I'm not sure how this
would work if that was the case, and I can't think of what else you
would mean by V.

Let's say that I am completely misunderstanding what you're saying and
that your suggestion works. We would then still have the problem of
centering H. There may be tricks to do this row-wise, but I am pretty
sure that the entire matrix has to be computed first, meaning that
your suggestion wouldn't work out after all.

> And you know, if you've got some clever MapReduce code to compute
> the full sparse kernelized H computation,  we could certainly use it here
> in Mahout!

I don't know if it's clever, but this is what we're currently doing:

Given an input file with (r,c) pairs, meaning there is a 1 at (r,c) in
the data matrix:
Round 1:
Identity Map
Expanding Reducer: (r,c) -> (c_1, (r, [c_{1..m}])), (c_2, (r, [c_{1..m}])),
ie:
1 2
1 3
2 1
2 4
4 1
4 2
would become
2    (1, [2, 3])
3    (1, [2, 3])
1    (2, [1, 4])
4    (2, [1, 4])
1    (4, [1, 2])
2    (4, [1, 2])

This is done to find the rows that have some overlap.

Round 2:
Identity Map
TupleMaker Reducer: (c_m, (r, [c_{1..m}])) -> ((r_1, r_2),
([c_1_{1..m}], [c_2_{1..m}]))
ie:
1    2 [1 4]
1    4 [1 2]
2    1 [2 3]
2    4 [1 2]
3    1 [2 3]
4    2 [1 4]
becomes:
[2 4]    [1 4] [1 2]
[1 4]    [2 3] [1 2]

now we have the rows that need to have the kernel function evaluated on

Round 3:
Identity Map
Remove Duplicates Reducer

Obviously, this removes any duplicates that might appear (none in this
example, but it does happen)

Round 4:
Kernel Map: ((r_1, r_2), ([c_1_{1..m}], [c_2_{1..m}])) -> ([r_1, r_2],
k(r_1, r_2))
ie, say we're using the dot product as the kernel function, so:
[2 4]    [1 4] [1 2]
[1 4]    [2 3] [1 2]
becomes:
[2 4]    9
[1 4]    8

We are doing so many rounds because we want a general approach to
applying kernel functions, and don't want to write a special case for
each kernel function we try. So this approach should work for all
pairwise kernels (like dot product, jaccard, cosine, etc)


And now to Ted:
> I would be tempted to plot the rank vs number of articles
> and look for the point where users seem to be above the Zipf curve.  I would
> just eliminate those users from the data and see what happens.

Oh, that's an interesting idea! It might be some time before we get to
this, but we will definitely try this!

Steven Buss
steven.buss@gmail.com
http://www.stevenbuss.com/

Re: Symmetric eigendecomposition for kernel PCA

Posted by Ted Dunning <te...@gmail.com>.

You did.  I saw that just after I posted.

On Wed, Feb 24, 2010 at 10:34 PM, Jake Mannix <ja...@gmail.com> wrote:

> Dude, didn't I just say that?
>
>

Re: Symmetric eigendecomposition for kernel PCA

Posted by Jake Mannix <ja...@gmail.com>.

Dude, didn't I just say that?

:P

On Wed, Feb 24, 2010 at 10:31 PM, Ted Dunning <te...@gmail.com> wrote:

> Here is the best reference I have on stochastic decompositions:
> http://arxiv.org/abs/0909.4061v1
>
> On Wed, Feb 24, 2010 at 8:12 PM, Steven Buss <st...@gmail.com>
> wrote:
>
> > At the risk of sounding woefully ignorant, I don't know what this
> > means. I will read about it and hopefully have something meaningful to
> > say tomorrow :)
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Symmetric eigendecomposition for kernel PCA

Posted by Ted Dunning <te...@gmail.com>.

Here is the best reference I have on stochastic decompositions:
http://arxiv.org/abs/0909.4061v1

On Wed, Feb 24, 2010 at 8:12 PM, Steven Buss <st...@gmail.com> wrote:

> At the risk of sounding woefully ignorant, I don't know what this
> means. I will read about it and hopefully have something meaningful to
> say tomorrow :)
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Symmetric eigendecomposition for kernel PCA

Posted by Ted Dunning <te...@gmail.com>.

Moreover, H is bigger than the original data due to partial fill-in and the
computation is almost always going to be I/O bound anyway.  Thus, FLOPS will
be as Jake says, wall clock will probably probably scale in the ratio of the
sizes of H to the original data.

On Wed, Feb 24, 2010 at 10:17 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> If instead you computed kernelized Lanczos all in one fell
> swoop, you'll be making k passes over the data, and computing
> the kernel each time, yes, but the scaling is just that of Lanczos:
>
> O(k * numUsers * kernelCost(userSparsity) )
>
> (as you mention, I forgot the kernel cost here, yes)
>
> This is tremendously better than computing the full kernel matrix,
> by a factor of k / numUsers = 5000 or so.  Unless I'm missing
> a factor somewhere.

-- 
Ted Dunning, CTO
DeepDyve

Re: Symmetric eigendecomposition for kernel PCA

Posted by Ted Dunning <te...@gmail.com>.

This is not only super sparse, but it is super skewed.  Having 71% of the
votes from 0.05% of the users is (a) suspicious relative to possible
spamming of the data set and (b) more skewed than any other real dataset I
have heard of.  I would be tempted to plot the rank vs number of articles
and look for the point where users seem to be above the Zipf curve.  I would
just eliminate those users from the data and see what happens.

LDA should definitely be on your horizon for testing with such sparse data
as Jake points out.

On Wed, Feb 24, 2010 at 10:17 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> > 470,640 users
> > 1,606,789 articles
> > 13,281,941 votes (0.00175% nonzero)
> > 43% of users voted on 3 or fewer articles (one vote per month)
> > 23% of users voted on more than 10 articles (87% of the data)
> > 0.05% of users voted on more than 100 articles (71% of votes)
> >
>
> Interesting data set - far more "items" than users, and *really*
> sparse.  SVD could definitely give you super crappy results for
> this data set, if my intuition is right.

-- 
Ted Dunning, CTO
DeepDyve

Re: Symmetric eigendecomposition for kernel PCA

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Feb 24, 2010 at 8:12 PM, Steven Buss <st...@gmail.com> wrote:

> > You've done SVD on the 400K x 1.6M input matrix already, and found
> > the results unsatisfactory?  What was it you were using the singular
> > vectors for, and how did you do the svd computation?
>
> We tried this approach first due to the stuff happening with the
> Netflix prize. I first tried using SVDLIBC, but it was taking an
> extremely long time and didn't give me any indication of how far it
> had come in computing the SVD, so I gave up on it and tried a gradient
> descent approach as described at
> http://sifter.org/~simon/journal/20061211.html by Simon Funk.

Ok, cool.  Brandyn Webb (aka Simon Funk) is actually the co-author of
the papre I used for the GHA code for SVD currently in Mahout as well, it's
very fast, for a non-parallel algorithm, and can certainly scale pretty well
to
the sizes you're talking about, and in fact should allow you to just plug in
your kernel also (in fact, I think he describes doing that in the article
you
linked to).

> I used
> the code at http://www.timelydevelopment.com/demos/NetflixPrize.aspx
> to find the first 64 or so singular values and vectors. I honestly
> cannot remember how I evaluated the results beyond looking at the
> recommendations and not getting what I expected. (This was nearly 16
> months ago, so maybe it's time to try SVD again using what I've
> learned since then).

It is pretty important to have a good metric on your results, because little
tweaks (how you normalized your input vectors, whether you mean-center
them, how you deal with the missing data, etc...) can have significant
effects on the results.

> One issue we have with the SVD is that it doesn't
> have any statistical meaning, we'd prefer to have results with
> probability and confidence.
>

Yeah, SVD can give pretty crappy results if you have a lot of
very small counts, and moving to something like Latent Dirchlet
Allocation (also available in Mahout!) might do better.

> 470,640 users
> 1,606,789 articles
> 13,281,941 votes (0.00175% nonzero)
> 43% of users voted on 3 or fewer articles (one vote per month)
> 23% of users voted on more than 10 articles (87% of the data)
> 0.05% of users voted on more than 100 articles (71% of votes)
>

Interesting data set - far more "items" than users, and *really*
sparse.  SVD could definitely give you super crappy results for
this data set, if my intuition is right.

> > H is the *distance* using your kernel function, or *similarity*?  If it's
> > sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is
> the
> > generalized inner product / kernel, right?
>
> Right, similarity, sorry. And yes, H(i,j) = k(r_i, r_j). We're
> currently using a jaccard kernel, but we haven't gotten far enough
> into trying out this method to try other kernel functions.
>

Ok cool.  For your data set, I bet Jaccard (Tanimoto) would work
pretty well.  Well, certainly better than euclidean or cosine.  But it
doesn't do anything for your super sparse situation.  I'm not sure
what would off the top of my head.

> > What I mean is this: when computing the SVD of your original matrix
> > via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> > summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> > with GHA).  If your kernel can be represented as
> >
> >  k(r_i, r_j) = f(r_i).dot( f(r_j) )
> >
> > where f is some map from R^(A.numCols) to R^(kernelDimension)
> > Then you can just apply the kernel to the rows as you iterate over
> > them and compute the SVD of the kernelized form of A.
>
> If I understand this correctly, this is essentially what we're doing,
> just combined into the same step. Ultimately this approach will end up
> computing the eigenvectors/values for the H matrix, but it only takes
> the data matrix as input instead of the kernel matrix, right? It
> doesn't seem to me like we're saving any cycles here, but that's not
> to say that I'm opposed to the idea.
>

Yes, you are essentially computing the eigenvectors/values of H, so
the question is whether it does save cycles.  To jump ahead to
your questioning of my scaling analysis at the end:

Compute H:

  O(numUsers^2 * kernelCost(userSparsity))

where properly, the cost of applying the kernel is a function of
userSparsity = number of nonzero elements in a given row.

Then inevitably H will be a bit more dense than your original
matrix (but not terribly so - it seems that you may have a
significant proportion of the rows of (H - Identity) which have
*no* nonzero entries, which makes it very hard to do
recommendations for these users.

But regardless, if you then run Lanczos eigendecomposition on
H, it'll be nice and fast, scaling something like

O(k * numUsers * numSimilarUsersPerUser ).

If instead you computed kernelized Lanczos all in one fell
swoop, you'll be making k passes over the data, and computing
the kernel each time, yes, but the scaling is just that of Lanczos:

O(k * numUsers * kernelCost(userSparsity) )

(as you mention, I forgot the kernel cost here, yes)

This is tremendously better than computing the full kernel matrix,
by a factor of k / numUsers = 5000 or so.  Unless I'm missing
a factor somewhere.

> I'm not intimately familiar with how the Lanczos algorithm works, but
> I am pretty sure that it will do several passes, meaning your
> suggestion will either repeatedly calculate k(r_i, r_j), or have to
> store the previous results. If its the former, then it doesn't seem
> like a great approach since we're wasting cycles, and if it's the
>

Is the former, but the difference is between computing the k(r_i, r_j)
for all i and j, and computing k(r_i, V) for all i, for a set of only K =
100's of dense eigenvectors (well, Lanczos vectors, used to produce
eigenvectors).  Way way way less computation of the kernel.

> > The other problem with
> > this approach is that you really need to make sure you're doing
> > a hashing trick on the kernelized rows (a la stochastic
> > decomposition), or else you're probably going to run into problems
> > of size on your singular vector set (their dimensionality will be too
> > big).  Stochastic decomposition is really the thing that will make
> > this work much better (and I've got planned for the 0.4 release).
>
> At the risk of sounding woefully ignorant, I don't know what this
> means. I will read about it and hopefully have something meaningful to
> say tomorrow :)
>

Check out: http://arxiv.org/abs/0909.4061
It's a great review, and it allows really interesting kernels (even the
infinite
dimensional ones) to be treated in the same way as I'm describing above,
without incurring the O(numUsers^2) cost of computing the full kernel
matrix.

It's probabalistic in the sense that it is not guaranteed to work, but the
bounds
of the chance of failure are so good that you don't really need to worry
about
it (or rather, you worry about it just enough to be sure you've chosen your
parameters such that it won't fail unless the same day you win the lottery
:) ).

> Like I mentioned earlier, we're currently trying out the jaccard
> kernel. My research partner is currently reading through Kernel
> Methods for Pattern Analysis, which will hopefully have some good
> suggestions. We plan to try some graph kernels as well (but there are
> no specifics yet).
>

I'd love to hear your thoughts on what kinds of kernels do well, because
we don't have much kernel work in Mahout yet, and knowing what
people like / fine useful can help us know what to add to the libraries
here (plus I just find it interesting).

> It's an easy enough algorithm to MapReduce the calculation of H for
> most kernels. The output of this is about 50GB from a ~150MB input
> matrix. With thresholding, we can shrink this down a bit, and it might
> even be the case that we don't need to find H for all users. We might
> be able to pick out some subset of users. We're doing this now with a
> 10% random sample, and can run kPCA on a single machine.
>

Oh it's certainly parallelizable, but it's still O(numUsers^2). Well,
actually
no, you're right.  If you properly MapReduce this matrix product sparsely,
you can quickly tell that at least for things like euclidean inner products,
cosine, and jaccard / tanimoto, you don't even need to do any work for
a significant chunk of the user-user pairs.  The question then becomes,
is it such that for any given user, you have to compute less than k =
reducedRank of your dimensional reduction kernel inner products.  If
so, then your MapReduce will indeed be more efficient than doing the
one-pass kernelized Lanczos.  Because the kernel computation is
typically very very fast for each user-user pair (probably safe to consider
it O(1) when nonzero), while for user-denseEigen pairs it'll scale as the
number of nonzero entries in k(r_i).

And you know, if you've got some clever MapReduce code to compute
the full sparse kernelized H computation,  we could certainly use it here
in Mahout!

  -jake

Re: Symmetric eigendecomposition for kernel PCA

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.

Steve,
     Looks like your dataset is extremely sparse as less than 0.05 % people voted on 100 articles or more. I have had past experience with a similar highly sparse dataset but in my case there were no ratings just (user, clicks) that I was dealing with. My team tried PLSI too at the time but it didn't give good results primarily due to highly sparse nature of the data. We obtained similar statistics about our data as posted by you and then preprocessed it by trimming clicks from the Top and bottom X users as they were suspected to be outliers.

An item co-occurrence matrix was then constructed by analyzing user sessions. The scores were time decayed and slightly boosted due to data specific aspects. This gave us similar items that were looking for and our testing indicated a good amount of click-through. Note that we weren't after user-specific recommendations (that was the next phase) but only "similar items".

Mahout has an implementation co-occurrence based recommendations that is still actively developed but you should be able to try it out in its present shape too. At the moment its a bit hard to plugin time decay and  induce any data specific bias but its good for a start.

Regards
-@nkur

 2/25/10 9:42 AM, "Steven Buss" <st...@gmail.com> wrote:

> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?

We tried this approach first due to the stuff happening with the
Netflix prize. I first tried using SVDLIBC, but it was taking an
extremely long time and didn't give me any indication of how far it
had come in computing the SVD, so I gave up on it and tried a gradient
descent approach as described at
http://sifter.org/~simon/journal/20061211.html by Simon Funk. I used
the code at http://www.timelydevelopment.com/demos/NetflixPrize.aspx
to find the first 64 or so singular values and vectors. I honestly
cannot remember how I evaluated the results beyond looking at the
recommendations and not getting what I expected. (This was nearly 16
months ago, so maybe it's time to try SVD again using what I've
learned since then). One issue we have with the SVD is that it doesn't
have any statistical meaning, we'd prefer to have results with
probability and confidence.

I'm not trying to argue against the SVD, indeed it works great in some
fields, but we have found that it doesn't produce very good results on
our data. We built our dataset by using the Digg API to fetch three
months of digg activity. Here's some details:

470,640 users
1,606,789 articles
13,281,941 votes (0.00175% nonzero)
43% of users voted on 3 or fewer articles (one vote per month)
23% of users voted on more than 10 articles (87% of the data)
0.05% of users voted on more than 100 articles (71% of votes)

I'd love to hear the experiences of people working with similar datasets.

> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?

Right, similarity, sorry. And yes, H(i,j) = k(r_i, r_j). We're
currently using a jaccard kernel, but we haven't gotten far enough
into trying out this method to try other kernel functions.

> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.

If I understand this correctly, this is essentially what we're doing,
just combined into the same step. Ultimately this approach will end up
computing the eigenvectors/values for the H matrix, but it only takes
the data matrix as input instead of the kernel matrix, right? It
doesn't seem to me like we're saving any cycles here, but that's not
to say that I'm opposed to the idea.

I'm not intimately familiar with how the Lanczos algorithm works, but
I am pretty sure that it will do several passes, meaning your
suggestion will either repeatedly calculate k(r_i, r_j), or have to
store the previous results. If its the former, then it doesn't seem
like a great approach since we're wasting cycles, and if it's the
latter then there's no real difference between passing in the kernel
matrix or finding it on the fly.

> The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).

At the risk of sounding woefully ignorant, I don't know what this
means. I will read about it and hopefully have something meaningful to
say tomorrow :)

> What kinds of kernels are you planning to use?

Like I mentioned earlier, we're currently trying out the jaccard
kernel. My research partner is currently reading through Kernel
Methods for Pattern Analysis, which will hopefully have some good
suggestions. We plan to try some graph kernels as well (but there are
no specifics yet).

> How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.

It's an easy enough algorithm to MapReduce the calculation of H for
most kernels. The output of this is about 50GB from a ~150MB input
matrix. With thresholding, we can shrink this down a bit, and it might
even be the case that we don't need to find H for all users. We might
be able to pick out some subset of users. We're doing this now with a
10% random sample, and can run kPCA on a single machine.

Isn't the cost of Lanczos you wrote ignoring the cost of evaluating
the kernel function?

Steven Buss
steven.buss@gmail.com
http://www.stevenbuss.com/

On Wed, Feb 24, 2010 at 5:30 PM, Jake Mannix <ja...@gmail.com> wrote:
> Hi Steven,
>
>  Glad to see you here in the land of more-than-140chars!
>
> On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <st...@gmail.com> wrote:
>>
>>
>> My research partner and I have a dataset that consists of 400,000
>> users and 1.6 million articles, with about 22 million nonzeros. We are
>> trying to use this data to make recommendations to users. We have
>> tried using the SVD and PLSI, both with unsatisfactory results, and
>> are now attempting kPCA.
>>
>
> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?
>
>
>> We have a 400,000 by 400,000 sparse symmetric positive-definite
>> matrix, H, that we need the top couple hundred eigenvectors/values
>> for. Jake has told me that I can use mahout 180 unchanged, but it will
>> be doing redundant work and the output eigenvalues are the squares of
>> the ones we actually want. This sounds like a good approach, but it
>> would be great if mahout had an optimized eigendecomposition for
>> symmetric matrices. Jake suggested I submit a JIRA ticket regarding
>> this, which I plan to do.
>>
>
> Depending on how much RAM you have on your box, "couple hundred"
> eigenvectors on 400k dimensions is at least "a couple" GB, and since
> Lanczos tends to get less eigenvectors than you expect (so you need
> to ask for more than you think you want), "a couple" could turn into
> maybe 10GB.  Doable, but hitting the boundaries of scalability.
>
> I'm in the process of filing another JIRA ticket concerning removing
> the need to keep the full Lanczos basis in RAM (because it's not really
> necessary, although it will be slower without the RAM available), to
> make this more scalable (the current implementation scales to "infinite"
> row size, but not infinite column size).
>
> H is the pairwise distance in feature space (calculated using a kernel
>> function) between each pair of users (or some subset of users). After
>> I mentioned this to Jake, he asked me "why aren't you just doing it
>> all in one go? Kernelize on the rows, and do SVD on that? Why do the
>> M*M^t intermediate step?" Unfortunately, I'm not sure what you're
>> asking, Jake, can you clarify?
>>
>
> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?
>
> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.
>
> Of course, if you're not doing a kernel representable in that way,
> then you're not getting far down this road.  The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).
>
> What kinds of kernels are you planning to use?  How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.
>
>  -jake
>

Re: Symmetric eigendecomposition for kernel PCA

Posted by Steven Buss <st...@gmail.com>.

> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?

We tried this approach first due to the stuff happening with the
Netflix prize. I first tried using SVDLIBC, but it was taking an
extremely long time and didn't give me any indication of how far it
had come in computing the SVD, so I gave up on it and tried a gradient
descent approach as described at
http://sifter.org/~simon/journal/20061211.html by Simon Funk. I used
the code at http://www.timelydevelopment.com/demos/NetflixPrize.aspx
to find the first 64 or so singular values and vectors. I honestly
cannot remember how I evaluated the results beyond looking at the
recommendations and not getting what I expected. (This was nearly 16
months ago, so maybe it's time to try SVD again using what I've
learned since then). One issue we have with the SVD is that it doesn't
have any statistical meaning, we'd prefer to have results with
probability and confidence.

I'm not trying to argue against the SVD, indeed it works great in some
fields, but we have found that it doesn't produce very good results on
our data. We built our dataset by using the Digg API to fetch three
months of digg activity. Here's some details:

470,640 users
1,606,789 articles
13,281,941 votes (0.00175% nonzero)
43% of users voted on 3 or fewer articles (one vote per month)
23% of users voted on more than 10 articles (87% of the data)
0.05% of users voted on more than 100 articles (71% of votes)

I'd love to hear the experiences of people working with similar datasets.

> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?

Right, similarity, sorry. And yes, H(i,j) = k(r_i, r_j). We're
currently using a jaccard kernel, but we haven't gotten far enough
into trying out this method to try other kernel functions.

> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.

If I understand this correctly, this is essentially what we're doing,
just combined into the same step. Ultimately this approach will end up
computing the eigenvectors/values for the H matrix, but it only takes
the data matrix as input instead of the kernel matrix, right? It
doesn't seem to me like we're saving any cycles here, but that's not
to say that I'm opposed to the idea.

I'm not intimately familiar with how the Lanczos algorithm works, but
I am pretty sure that it will do several passes, meaning your
suggestion will either repeatedly calculate k(r_i, r_j), or have to
store the previous results. If its the former, then it doesn't seem
like a great approach since we're wasting cycles, and if it's the
latter then there's no real difference between passing in the kernel
matrix or finding it on the fly.

> The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).

At the risk of sounding woefully ignorant, I don't know what this
means. I will read about it and hopefully have something meaningful to
say tomorrow :)

> What kinds of kernels are you planning to use?

Like I mentioned earlier, we're currently trying out the jaccard
kernel. My research partner is currently reading through Kernel
Methods for Pattern Analysis, which will hopefully have some good
suggestions. We plan to try some graph kernels as well (but there are
no specifics yet).

> How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.

It's an easy enough algorithm to MapReduce the calculation of H for
most kernels. The output of this is about 50GB from a ~150MB input
matrix. With thresholding, we can shrink this down a bit, and it might
even be the case that we don't need to find H for all users. We might
be able to pick out some subset of users. We're doing this now with a
10% random sample, and can run kPCA on a single machine.

Isn't the cost of Lanczos you wrote ignoring the cost of evaluating
the kernel function?

Steven Buss
steven.buss@gmail.com
http://www.stevenbuss.com/

On Wed, Feb 24, 2010 at 5:30 PM, Jake Mannix <ja...@gmail.com> wrote:
> Hi Steven,
>
>  Glad to see you here in the land of more-than-140chars!
>
> On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <st...@gmail.com> wrote:
>>
>>
>> My research partner and I have a dataset that consists of 400,000
>> users and 1.6 million articles, with about 22 million nonzeros. We are
>> trying to use this data to make recommendations to users. We have
>> tried using the SVD and PLSI, both with unsatisfactory results, and
>> are now attempting kPCA.
>>
>
> You've done SVD on the 400K x 1.6M input matrix already, and found
> the results unsatisfactory?  What was it you were using the singular
> vectors for, and how did you do the svd computation?
>
>
>> We have a 400,000 by 400,000 sparse symmetric positive-definite
>> matrix, H, that we need the top couple hundred eigenvectors/values
>> for. Jake has told me that I can use mahout 180 unchanged, but it will
>> be doing redundant work and the output eigenvalues are the squares of
>> the ones we actually want. This sounds like a good approach, but it
>> would be great if mahout had an optimized eigendecomposition for
>> symmetric matrices. Jake suggested I submit a JIRA ticket regarding
>> this, which I plan to do.
>>
>
> Depending on how much RAM you have on your box, "couple hundred"
> eigenvectors on 400k dimensions is at least "a couple" GB, and since
> Lanczos tends to get less eigenvectors than you expect (so you need
> to ask for more than you think you want), "a couple" could turn into
> maybe 10GB.  Doable, but hitting the boundaries of scalability.
>
> I'm in the process of filing another JIRA ticket concerning removing
> the need to keep the full Lanczos basis in RAM (because it's not really
> necessary, although it will be slower without the RAM available), to
> make this more scalable (the current implementation scales to "infinite"
> row size, but not infinite column size).
>
> H is the pairwise distance in feature space (calculated using a kernel
>> function) between each pair of users (or some subset of users). After
>> I mentioned this to Jake, he asked me "why aren't you just doing it
>> all in one go? Kernelize on the rows, and do SVD on that? Why do the
>> M*M^t intermediate step?" Unfortunately, I'm not sure what you're
>> asking, Jake, can you clarify?
>>
>
> H is the *distance* using your kernel function, or *similarity*?  If it's
> sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
> generalized inner product / kernel, right?
>
> What I mean is this: when computing the SVD of your original matrix
> via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
> summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
> with GHA).  If your kernel can be represented as
>
>  k(r_i, r_j) = f(r_i).dot( f(r_j) )
>
> where f is some map from R^(A.numCols) to R^(kernelDimension)
> Then you can just apply the kernel to the rows as you iterate over
> them and compute the SVD of the kernelized form of A.
>
> Of course, if you're not doing a kernel representable in that way,
> then you're not getting far down this road.  The other problem with
> this approach is that you really need to make sure you're doing
> a hashing trick on the kernelized rows (a la stochastic
> decomposition), or else you're probably going to run into problems
> of size on your singular vector set (their dimensionality will be too
> big).  Stochastic decomposition is really the thing that will make
> this work much better (and I've got planned for the 0.4 release).
>
> What kinds of kernels are you planning to use?  How do you
> plan on dealing with the userSparsity * numUsers^2 * kernelCost
> computational cost of producing H?  That's even higher than the
> desiredRank * numUsers * userSparsity cost of Lanczos.
>
>  -jake
>

Re: Symmetric eigendecomposition for kernel PCA

Posted by Jake Mannix <ja...@gmail.com>.

Hi Steven,

  Glad to see you here in the land of more-than-140chars!

On Wed, Feb 24, 2010 at 1:53 PM, Steven Buss <st...@gmail.com> wrote:
>
>
> My research partner and I have a dataset that consists of 400,000
> users and 1.6 million articles, with about 22 million nonzeros. We are
> trying to use this data to make recommendations to users. We have
> tried using the SVD and PLSI, both with unsatisfactory results, and
> are now attempting kPCA.
>

You've done SVD on the 400K x 1.6M input matrix already, and found
the results unsatisfactory?  What was it you were using the singular
vectors for, and how did you do the svd computation?

> We have a 400,000 by 400,000 sparse symmetric positive-definite
> matrix, H, that we need the top couple hundred eigenvectors/values
> for. Jake has told me that I can use mahout 180 unchanged, but it will
> be doing redundant work and the output eigenvalues are the squares of
> the ones we actually want. This sounds like a good approach, but it
> would be great if mahout had an optimized eigendecomposition for
> symmetric matrices. Jake suggested I submit a JIRA ticket regarding
> this, which I plan to do.
>

Depending on how much RAM you have on your box, "couple hundred"
eigenvectors on 400k dimensions is at least "a couple" GB, and since
Lanczos tends to get less eigenvectors than you expect (so you need
to ask for more than you think you want), "a couple" could turn into
maybe 10GB.  Doable, but hitting the boundaries of scalability.

I'm in the process of filing another JIRA ticket concerning removing
the need to keep the full Lanczos basis in RAM (because it's not really
necessary, although it will be slower without the RAM available), to
make this more scalable (the current implementation scales to "infinite"
row size, but not infinite column size).

H is the pairwise distance in feature space (calculated using a kernel
> function) between each pair of users (or some subset of users). After
> I mentioned this to Jake, he asked me "why aren't you just doing it
> all in one go? Kernelize on the rows, and do SVD on that? Why do the
> M*M^t intermediate step?" Unfortunately, I'm not sure what you're
> asking, Jake, can you clarify?
>

H is the *distance* using your kernel function, or *similarity*?  If it's
sparse, I imagine you mean it's really H(i,j) = k(r_i, r_j) where k is the
generalized inner product / kernel, right?

What I mean is this: when computing the SVD of your original matrix
via Lanczos, you compute (A*A^t) * v by taking the rows r_i and
summing up all of the: ( v.dot(r_i) ) * r_i  (a similar thing happens
with GHA).  If your kernel can be represented as

  k(r_i, r_j) = f(r_i).dot( f(r_j) )

where f is some map from R^(A.numCols) to R^(kernelDimension)
Then you can just apply the kernel to the rows as you iterate over
them and compute the SVD of the kernelized form of A.

Of course, if you're not doing a kernel representable in that way,
then you're not getting far down this road.  The other problem with
this approach is that you really need to make sure you're doing
a hashing trick on the kernelized rows (a la stochastic
decomposition), or else you're probably going to run into problems
of size on your singular vector set (their dimensionality will be too
big).  Stochastic decomposition is really the thing that will make
this work much better (and I've got planned for the 0.4 release).

What kinds of kernels are you planning to use?  How do you
plan on dealing with the userSparsity * numUsers^2 * kernelCost
computational cost of producing H?  That's even higher than the
desiredRank * numUsers * userSparsity cost of Lanczos.

  -jake