You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/09/06 19:17:46 UTC

Doing dimensionality reduction with SSVD and Lanczos

When using Laczos the recommendation is to use clean eigen vectors as a distributed row matrix--call it V.

A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.

Dmitriy and Ted recommend when using SSVD to do:

A-hat = US 

When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver. One difficulty is that to perform the multiplication you have to turn the singular values vector (diagonal values) into a distributed row matrix or write your own multiply function, correct?

Questions:
For SSVD can someone explain why US is preferred? Given A = USV^t how can you ignore the effect of V^t? Is this only for PCA? In other words if you did not use PCA weighting would you ignore V^t?
For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I mistaken? Also shouldn't A-hat be transposed before performing kmeans or other analysis?



> Dmitriy said
With SSVD you need just US  (or U*Sigma in other notation).
This is your dimensionally reduced output of your original document
matrix you've run with --pca option.

As Ted suggests, you may also use US^0.5 which is already produced by
providing --uHalfSigma (or its embedded setter analog). the keys of
that output (produced by getUPath() call) will already contain your
Text document ids as sequence file keys.

-d



Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
yeah, I realized we are talking about different things. 

I need the ID, not the key. The ID is part of the VectorWritable and can be an int or text, text in my case. I'm looking at the jobs you mention, maybe there is a clue there...

On Sep 7, 2012, at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

just checked again. yes U inherits keys from sequence file of A. (are
we talking about keys of the sequence files or names of a NamedVector?
NamedVector is not supported, keys of sequence files are.)

On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
> 
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>> 
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>> 
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>> 
>>> 
>>> For
>>>    BtJob.run(conf,
>>>                inputPath,
>>>                qPath,
>>>                pcaMeanPath,
>>>                btPath,
>>>                minSplitSize,
>>>                k,
>>>                p,
>>>                outerBlockHeight,
>>>                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                broadcast,
>>>                labelType,
>>>                q <= 0);
>>> 
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>> 
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>> 
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>> 
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>> 
>> Yes, that's the idea.
>> 
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>> 
>>> 
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>> 
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>> 
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>> 
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>> 
>>>>     if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>       sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>       // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>     }
>>>> 
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>> 
>>>>     if (sValues != null) {
>>>>       for (int i = 0; i < k; i++) {
>>>>         uRow.setQuick(i,
>>>>                       qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>       }
>>>>     } else {
>>>>       …
>>>> 
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>> 
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> 
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>> 
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>> 
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>> 
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>> 
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>> 
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>> 
>>>> 
>>>> 
>>> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Sep 10, 2012 at 11:20 AM, Pat Ferrel <pa...@gmail.com> wrote:
> If my logic is basically sound then the question remains: How do we get vectors through DR  and/or LSA/PCA and back into term space?

These vectors will be dense and have the size of the dictionary. Even
if there are few of them, but they are in half million length, that's
probably an overkill. I don't see why you wanted to project them back.
(there might be some interest in knowing distance between centroid and
the point or do something like multidimensional scaling to visualize
clusters to users but you could do it just fine in the reduced space.

fold-out (inverse transofmation) is not directly supported and is not
the scope of this PCA exercise but like i said it is a simple matter
of V*c^t where c is the centroid column vector. If you like to do it
with Mahout, i think there's a DRM operation for vector-matrix
multiplication (or even matrix-matrix multiplication in case of
multiple centroids). However something tells me in-memory operation
will be both feasible and more desirable.

>
>
> On Sep 10, 2012, at 9:12 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> It is in the doc, but in short, the gist of --pca options is as follows:
>
> Normally, PCA conversion is projection onto (offset) orthogonal basis
> subspace which retains most of variances in the original data. It so
> happens that such basis corresponds to right singular vectors of (A-M)
> hence the projection is formed by (A-M)*V, which is approximated by
> U*Sigma of reduced rank svd of (A-M). (Assuming row vector data
> points). This is the scope of our PCA task here.
>
> Now, consider a typical document corpus of say 1 million documents.
> Typically (depending on stemming) total dictionary size would be
> anywhere between 10^5 to 10^6, but let's assume 10^5 for our purposes.
> However, indvidual document will perhaps have 100 terms on average.
> Thus the PCA input for such corpus will be a 1M x 100k sparse matrix
> but the number of non-zero elements in it will be just 1M x 100 ==
> 100M. This is a fairly small problem for an SSVD to solve.
>
> However, we want to run SVD on (A-M) where each row of M is a column
> mean of A. Column mean is a dense matrix, hence M is a dense matrix
> too, and so is the (A-M).
> Then in our hypothetical case we observe that (A-M) takes 1M x 100k
> non-zero elements, which means it takes  1000 times more space than A
> and its ssvd will take, asymptotically speaking, (1000)^(3/2) more
> llops than A, that is almost 5 orders of magnitude more svd
> computational difference.
>
> However, SSVD algorithm may be amended so that it doesn't have to ever
> compute svd(A-M). I will not go into detail, but in essence the bottom
> line is that it reduces computational expense to that being equivalent
> to svd(A), i.e. almost 5 orders of magnitude in our hypothetical
> corpus PCA case.
>
> Hope it makes things clearer.
>
> On Mon, Sep 10, 2012 at 7:25 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> Another issue with the SSVD job options is that if you want to use SSVD but keep the input in the original dimention-space (term-space in many cases) you would do the following
>> * create input matrix A based on input dimensions (terms)
>> * calculate the full transform, which retains the output in "term-space" AHat = U*Sigma*V^t
>> * at the end AHat should be in term-space but transformed by DR and Sigma weighted for PCA, right?
>> * then AHat can be substituted for A where analysis or examination needs the original dimension definitions (terms).
>>
>> The problem with the options is that when you set --uHalfSigma OR --vHalfSigma it sets the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only check to see if the sVectors exist and then they both apply them. In other words, either --uHalfSigma is set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.
>>
>> To do U*Sigma*V^t  the SSVD code would have to be changed or the U * Sigma would have to be calculated outside SSVD (an ugly alternative).
>>
>> But please correct me where I'm wrong.
>>
>>
>> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>> I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach.
>>
>> I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case).
>>
>> As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below.
>>
>> Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.
>>
>> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> I can do a patch to propagate names of named vectors from A to U too
>> if that's a requirement for what you do. But we need to make sure it
>> solves your problem. i am still not sure what are IDs in your
>> definition and what is required for k-means.
>>
>> Thinking of that, it's probably a worthy patch anyway. I'll write
>> something up along with API changes for A*Sigma outputs. I think since
>> there are so many output options, they should be redesigned not to be
>> mutually exclusive.
>>
>> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
>>>
>>> I'm not getting any id at all in the cluster data, not even a key for a row.
>>>
>>> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
>>>
>>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> yeah seq2sparse seems to have -nv option now to churn out named
>>> vectors too. It doesn't seem to be listed in the MIA book though.
>>>
>>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> Sequence file keys, on the other hand, is
>>>>> what populated by seq2sparse output, so they are useful for mapping
>>>>> results to original documents.
>>>>
>>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>>> been some time since i looked at this for the last time.
>>>
>>
>>
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you.

(as usual, it kills me how little i know) :)

On Mon, Sep 10, 2012 at 2:14 PM, Ted Dunning <te...@gmail.com> wrote:
> You are correct-ish.
>
> Left and right inverses are still inverses, just not commutative.
>
> Moreover, since V is full rank and orthonormal, the SVD is trivially:
>
>    svd(V) = V I I
>
> This means that while V' is the left inverse, it is also the right Penrose
> inverse in the sense that the least squares solution of argmin_X norm(V X -
> I) is V'.
>
> On Mon, Sep 10, 2012 at 2:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Yep that's my understanding too. V cannot have inverse since it is not
>> square (in common case).
>>
>> But since it is orthonormal, transpose produces a pseudoinverse which
>> we can use just the same.
>>
>> On Mon, Sep 10, 2012 at 2:03 PM, Sean Owen <sr...@gmail.com> wrote:
>> > Yes, doesn't the V V' != I bit mean it's not an inverse? I thought
>> > only square matrices had a real inverse. This is a one-sided inverse,
>> > which is (IIRC?) slightly stronger than being a pseudo-inverse.
>> >
>> > On Mon, Sep 10, 2012 at 9:59 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >> No.  It is the inverse.  V' V = I
>> >>
>> >> On the other hand, V V' != I.  We do know that norm(A V V' - A) is
>> small.
>>

Re: SSVD compute U * Sigma

Posted by Ted Dunning <te...@gmail.com>.
You are correct-ish.

Left and right inverses are still inverses, just not commutative.

Moreover, since V is full rank and orthonormal, the SVD is trivially:

   svd(V) = V I I

This means that while V' is the left inverse, it is also the right Penrose
inverse in the sense that the least squares solution of argmin_X norm(V X -
I) is V'.

On Mon, Sep 10, 2012 at 2:05 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Yep that's my understanding too. V cannot have inverse since it is not
> square (in common case).
>
> But since it is orthonormal, transpose produces a pseudoinverse which
> we can use just the same.
>
> On Mon, Sep 10, 2012 at 2:03 PM, Sean Owen <sr...@gmail.com> wrote:
> > Yes, doesn't the V V' != I bit mean it's not an inverse? I thought
> > only square matrices had a real inverse. This is a one-sided inverse,
> > which is (IIRC?) slightly stronger than being a pseudo-inverse.
> >
> > On Mon, Sep 10, 2012 at 9:59 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >> No.  It is the inverse.  V' V = I
> >>
> >> On the other hand, V V' != I.  We do know that norm(A V V' - A) is
> small.
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yep that's my understanding too. V cannot have inverse since it is not
square (in common case).

But since it is orthonormal, transpose produces a pseudoinverse which
we can use just the same.

On Mon, Sep 10, 2012 at 2:03 PM, Sean Owen <sr...@gmail.com> wrote:
> Yes, doesn't the V V' != I bit mean it's not an inverse? I thought
> only square matrices had a real inverse. This is a one-sided inverse,
> which is (IIRC?) slightly stronger than being a pseudo-inverse.
>
> On Mon, Sep 10, 2012 at 9:59 PM, Ted Dunning <te...@gmail.com> wrote:
>> No.  It is the inverse.  V' V = I
>>
>> On the other hand, V V' != I.  We do know that norm(A V V' - A) is small.

Re: SSVD compute U * Sigma

Posted by Sean Owen <sr...@gmail.com>.
Yes, doesn't the V V' != I bit mean it's not an inverse? I thought
only square matrices had a real inverse. This is a one-sided inverse,
which is (IIRC?) slightly stronger than being a pseudo-inverse.

On Mon, Sep 10, 2012 at 9:59 PM, Ted Dunning <te...@gmail.com> wrote:
> No.  It is the inverse.  V' V = I
>
> On the other hand, V V' != I.  We do know that norm(A V V' - A) is small.

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Sep 10, 2012 at 2:03 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Mon, Sep 10, 2012 at 1:59 PM, Ted Dunning <te...@gmail.com> wrote:
>> No.  It is the inverse.  V' V = I
>
> Pseudoinverse has the same property afaik.

PS if V is orthonormal which it is

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Sep 10, 2012 at 1:59 PM, Ted Dunning <te...@gmail.com> wrote:
> No.  It is the inverse.  V' V = I

Pseudoinverse has the same property afaik. The difference as i have it
from school is  that strictly speaking regular inverse is defined for
square matrices only (nxn) and V is not square... But i guess it is a
matter of interpretation..

>
> On the other hand, V V' != I.  We do know that norm(A V V' - A) is small.
>
> On Mon, Sep 10, 2012 at 1:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Actually i think it is Penrose pseudo-inverse but for our purposes i
>> think it has the same effect as actual inverses...
>>
>> On Mon, Sep 10, 2012 at 1:35 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> >
>> >> The inverse of V is trivial.  It is orthonormal so the inverse is the
>> >> transpose.
>> >>
>> >
>> > D'oh, that brain cell has been dormant since Linear Algebra 101, which
>> is good. I thought it had died.
>>

Re: SSVD compute U * Sigma

Posted by Ted Dunning <te...@gmail.com>.
No.  It is the inverse.  V' V = I

On the other hand, V V' != I.  We do know that norm(A V V' - A) is small.

On Mon, Sep 10, 2012 at 1:57 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Actually i think it is Penrose pseudo-inverse but for our purposes i
> think it has the same effect as actual inverses...
>
> On Mon, Sep 10, 2012 at 1:35 PM, Pat Ferrel <pa...@gmail.com> wrote:
> >
> >> The inverse of V is trivial.  It is orthonormal so the inverse is the
> >> transpose.
> >>
> >
> > D'oh, that brain cell has been dormant since Linear Algebra 101, which
> is good. I thought it had died.
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Actually i think it is Penrose pseudo-inverse but for our purposes i
think it has the same effect as actual inverses...

On Mon, Sep 10, 2012 at 1:35 PM, Pat Ferrel <pa...@gmail.com> wrote:
>
>> The inverse of V is trivial.  It is orthonormal so the inverse is the
>> transpose.
>>
>
> D'oh, that brain cell has been dormant since Linear Algebra 101, which is good. I thought it had died.

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
> The inverse of V is trivial.  It is orthonormal so the inverse is the
> transpose.
> 

D'oh, that brain cell has been dormant since Linear Algebra 101, which is good. I thought it had died.

Re: SSVD compute U * Sigma

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Sep 10, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:

> Got it.
>
> Maybe it would be enough to do the projection into PCA space and back
> project only a very very few individual vectors (cluster centroids and
> special docs).


Yes.   On-demand fold-out works well.


> As to density we can also do with only the top few terms so we could
> drastically compress the dense vector. Still that would require calculating
> the inverse of V I think.
>

The inverse of V is trivial.  It is orthonormal so the inverse is the
transpose.

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
Got it. 

Maybe it would be enough to do the projection into PCA space and back project only a very very few individual vectors (cluster centroids and special docs). As to density we can also do with only the top few terms so we could drastically compress the dense vector. Still that would require calculating the inverse of V I think.  

For now we'll use the original input vectors when inspecting terms.

On Sep 10, 2012, at 11:37 AM, Ted Dunning <te...@gmail.com> wrote:

Danger Will Robinson.

This will make your vectors dense.  That can be disastrous.

Even if you don't try to *store* these dense vectors, merely moving them
around memory can be prohibitive.

On Mon, Sep 10, 2012 at 11:20 AM, Pat Ferrel <pa...@gmail.com> wrote:

>      • Project the vectors back into the original term space.  Here the
> assumption is that the back projected vectors will be "cleaner" in some
> sense.


Re: SSVD compute U * Sigma

Posted by Ted Dunning <te...@gmail.com>.
Danger Will Robinson.

This will make your vectors dense.  That can be disastrous.

Even if you don't try to *store* these dense vectors, merely moving them
around memory can be prohibitive.

On Mon, Sep 10, 2012 at 11:20 AM, Pat Ferrel <pa...@gmail.com> wrote:

>       • Project the vectors back into the original term space.  Here the
> assumption is that the back projected vectors will be "cleaner" in some
> sense.

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
Thanks, this is very helpful. Your explanation of how SSVD works and the dense matrices it produces may explain why calculating U * Sigma * V^t is not practical. I think you are saying that it is not practical, not that the logic/math is wrong? But anyway here is my reasoning and maybe there is another way to skin the cat…

First I don't require LSA/PCA, it just seems like a useful technique given the application below. I do require the benefits of DR on the assumption that it can be used to reduce the affect of input noise. My assumption is that PCA will add the benefit of amplifying the important factors in the input, though this must be verified with use.

Goals:
	• Cluster and calculate vector similarity on input data taken from web pages. This works now but noise in the term space is often pretty high even after applying term scrubbing techniques like stemming, TFIDF, use of the smart parser Boilerpipe, etc.
	• Since the data is somewhat noisy, project the A vectors into LSA/PCA space perfroming DR at the same time. Then run the projected vectors through clustering and vector similarity. This can be done by running analysis on U * Sigma (if I can figure out how to get clustering to recognize the input format of U * Sigma and this still blocks me).
	• Project the vectors back into the original term space.  Here the assumption is that the back projected vectors will be "cleaner" in some sense.
	• Drop the back projected vectors into an existing term-based UI, and enable other term-based analysis like vector-based Solr queries.

I can give several examples of how the term-space representation of vectors is used. Suffice it to say they all work fine with a non-LSA analysis pipeline. Basically now I'm trying to use some DR and/or LSA/PCA techniques to improve results but still need to look at results in term space. If this can't be done maybe I have to rely on projected vectors for some analysis and use the original vectors for other analysis or UI.

If my logic is basically sound then the question remains: How do we get vectors through DR  and/or LSA/PCA and back into term space? 


On Sep 10, 2012, at 9:12 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

It is in the doc, but in short, the gist of --pca options is as follows:

Normally, PCA conversion is projection onto (offset) orthogonal basis
subspace which retains most of variances in the original data. It so
happens that such basis corresponds to right singular vectors of (A-M)
hence the projection is formed by (A-M)*V, which is approximated by
U*Sigma of reduced rank svd of (A-M). (Assuming row vector data
points). This is the scope of our PCA task here.

Now, consider a typical document corpus of say 1 million documents.
Typically (depending on stemming) total dictionary size would be
anywhere between 10^5 to 10^6, but let's assume 10^5 for our purposes.
However, indvidual document will perhaps have 100 terms on average.
Thus the PCA input for such corpus will be a 1M x 100k sparse matrix
but the number of non-zero elements in it will be just 1M x 100 ==
100M. This is a fairly small problem for an SSVD to solve.

However, we want to run SVD on (A-M) where each row of M is a column
mean of A. Column mean is a dense matrix, hence M is a dense matrix
too, and so is the (A-M).
Then in our hypothetical case we observe that (A-M) takes 1M x 100k
non-zero elements, which means it takes  1000 times more space than A
and its ssvd will take, asymptotically speaking, (1000)^(3/2) more
llops than A, that is almost 5 orders of magnitude more svd
computational difference.

However, SSVD algorithm may be amended so that it doesn't have to ever
compute svd(A-M). I will not go into detail, but in essence the bottom
line is that it reduces computational expense to that being equivalent
to svd(A), i.e. almost 5 orders of magnitude in our hypothetical
corpus PCA case.

Hope it makes things clearer.

On Mon, Sep 10, 2012 at 7:25 AM, Pat Ferrel <pa...@gmail.com> wrote:
> Another issue with the SSVD job options is that if you want to use SSVD but keep the input in the original dimention-space (term-space in many cases) you would do the following
> * create input matrix A based on input dimensions (terms)
> * calculate the full transform, which retains the output in "term-space" AHat = U*Sigma*V^t
> * at the end AHat should be in term-space but transformed by DR and Sigma weighted for PCA, right?
> * then AHat can be substituted for A where analysis or examination needs the original dimension definitions (terms).
> 
> The problem with the options is that when you set --uHalfSigma OR --vHalfSigma it sets the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only check to see if the sVectors exist and then they both apply them. In other words, either --uHalfSigma is set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.
> 
> To do U*Sigma*V^t  the SSVD code would have to be changed or the U * Sigma would have to be calculated outside SSVD (an ugly alternative).
> 
> But please correct me where I'm wrong.
> 
> 
> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:
> 
> I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach.
> 
> I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case).
> 
> As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below.
> 
> Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.
> 
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for k-means.
> 
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
> 
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
>> 
>> I'm not getting any id at all in the cluster data, not even a key for a row.
>> 
>> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
>> 
>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> yeah seq2sparse seems to have -nv option now to churn out named
>> vectors too. It doesn't seem to be listed in the MIA book though.
>> 
>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> Sequence file keys, on the other hand, is
>>>> what populated by seq2sparse output, so they are useful for mapping
>>>> results to original documents.
>>> 
>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>> been some time since i looked at this for the last time.
>> 
> 
> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
It is in the doc, but in short, the gist of --pca options is as follows:

Normally, PCA conversion is projection onto (offset) orthogonal basis
subspace which retains most of variances in the original data. It so
happens that such basis corresponds to right singular vectors of (A-M)
hence the projection is formed by (A-M)*V, which is approximated by
U*Sigma of reduced rank svd of (A-M). (Assuming row vector data
points). This is the scope of our PCA task here.

Now, consider a typical document corpus of say 1 million documents.
Typically (depending on stemming) total dictionary size would be
anywhere between 10^5 to 10^6, but let's assume 10^5 for our purposes.
However, indvidual document will perhaps have 100 terms on average.
Thus the PCA input for such corpus will be a 1M x 100k sparse matrix
but the number of non-zero elements in it will be just 1M x 100 ==
100M. This is a fairly small problem for an SSVD to solve.

However, we want to run SVD on (A-M) where each row of M is a column
mean of A. Column mean is a dense matrix, hence M is a dense matrix
too, and so is the (A-M).
Then in our hypothetical case we observe that (A-M) takes 1M x 100k
non-zero elements, which means it takes  1000 times more space than A
and its ssvd will take, asymptotically speaking, (1000)^(3/2) more
llops than A, that is almost 5 orders of magnitude more svd
computational difference.

However, SSVD algorithm may be amended so that it doesn't have to ever
compute svd(A-M). I will not go into detail, but in essence the bottom
line is that it reduces computational expense to that being equivalent
to svd(A), i.e. almost 5 orders of magnitude in our hypothetical
corpus PCA case.

Hope it makes things clearer.

On Mon, Sep 10, 2012 at 7:25 AM, Pat Ferrel <pa...@gmail.com> wrote:
> Another issue with the SSVD job options is that if you want to use SSVD but keep the input in the original dimention-space (term-space in many cases) you would do the following
> * create input matrix A based on input dimensions (terms)
> * calculate the full transform, which retains the output in "term-space" AHat = U*Sigma*V^t
> * at the end AHat should be in term-space but transformed by DR and Sigma weighted for PCA, right?
> * then AHat can be substituted for A where analysis or examination needs the original dimension definitions (terms).
>
> The problem with the options is that when you set --uHalfSigma OR --vHalfSigma it sets the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only check to see if the sVectors exist and then they both apply them. In other words, either --uHalfSigma is set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.
>
> To do U*Sigma*V^t  the SSVD code would have to be changed or the U * Sigma would have to be calculated outside SSVD (an ugly alternative).
>
> But please correct me where I'm wrong.
>
>
> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:
>
> I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach.
>
> I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case).
>
> As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below.
>
> Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.
>
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for k-means.
>
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
>
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
>>
>> I'm not getting any id at all in the cluster data, not even a key for a row.
>>
>> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
>>
>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> yeah seq2sparse seems to have -nv option now to churn out named
>> vectors too. It doesn't seem to be listed in the MIA book though.
>>
>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> Sequence file keys, on the other hand, is
>>>> what populated by seq2sparse output, so they are useful for mapping
>>>> results to original documents.
>>>
>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>> been some time since i looked at this for the last time.
>>
>
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Sep 10, 2012 at 8:00 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Sep 10, 2012 7:26 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>> The problem with the options is that when you set --uHalfSigma OR
>> --vHalfSigma it sets the sVectors to sqrt and that will cause it to be
>> applied to U and V since UJob and VJob only check to see if the sVectors
>> exist and then they both apply them. In other words, either --uHalfSigma is
>> set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.
>
> I dont think this statement is correct. Ill check for it but i am pretty
> sure this is not how it works. Left and right singular vectors scaled to
> similar space individually on an explicit option to do so.
>

Yes i checked it and it is what i said.

>>
>> To do U*Sigma*V^t  the SSVD code would have to be changed or the U * Sigma
>> would have to be calculated outside SSVD (an ugly alternative).
>>
>> But please correct me where I'm wrong.
>>
>>
>> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>> I appreciate it and believe that this will help others too. I also agree
>> that we should think this one through to see if it is the correct approach.
>>
>> I need to figure out why the row ids/Keys of the input matrix are not
>> getting through clustering. Key/row ids are getting through rowsimilarity
>> when applied to U*S so why not clustering? In other words with rowsimilarity
>> I can map the results back to the original input rows (documents in my
>> case).
>>
>> As to the U*S option, I agree. I modified the code to take a new --uSigma
>> but it is mutually exclusive of --uHalfSigma and that indicates that neither
>> option should be a boolean. They both also imply calculation of U. I assume
>> this is what you meant below.
>>
>> Another thing to consider and here my ignorance shows through… The U*S
>> (equivalent to A*V) transform to V-space must be reversible so that humans
>> can see results in terms of of the original term-space. Weights base on the
>> new basis are not human understandable really. But setting me straight here
>> may be another conversation.
>>
>> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> I can do a patch to propagate names of named vectors from A to U too
>> if that's a requirement for what you do. But we need to make sure it
>> solves your problem. i am still not sure what are IDs in your
>> definition and what is required for k-means.
>>
>> Thinking of that, it's probably a worthy patch anyway. I'll write
>> something up along with API changes for A*Sigma outputs. I think since
>> there are so many output options, they should be redesigned not to be
>> mutually exclusive.
>>
>> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> > Yes, I would love to use namedvectors. But no matter doing a key to row
>> > lookup is easy enough.
>> >
>> > I'm not getting any id at all in the cluster data, not even a key for a
>> > row.
>> >
>> > I'm beginning to think this is a clustering problem since rowsimilarity
>> > at least gives me row keys to identify objects associated with an object.
>> >
>> > On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> >
>> > yeah seq2sparse seems to have -nv option now to churn out named
>> > vectors too. It doesn't seem to be listed in the MIA book though.
>> >
>> > On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > wrote:
>> >> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >> Sequence file keys, on the other hand, is
>> >>> what populated by seq2sparse output, so they are useful for mapping
>> >>> results to original documents.
>> >>
>> >> Although honestly i am not so sure about seq2sparse anymore. There has
>> >> been some time since i looked at this for the last time.
>> >
>>
>>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I dont undestand this. What is ahat and how is it different from the input
with subtacted mean?

Ssvd pca doesnt form A-M becuase that thing is dense and that's the whole
point of pca option, i.e to avoid intermediate ugly and big dense product
and svd operation on it. The space required for A-M will be several orders
of magnitide bigger than already sufficiently large input and running svd
on it will take firever because ssvd flops are f^3. Bottom line ssvd pca
cannot produce A-M simply because it never computes it.

If you need A-M, i guess you could form it yourself, but in any sound
architecture it d be much better to form rows of A-M on the fly instead of
computing and storing them.

Still though i havent yet figured what you are trying to accomplish.

On Sep 10, 2012 7:26 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
> Another issue with the SSVD job options is that if you want to use SSVD
but keep the input in the original dimention-space (term-space in many
cases) you would do the following
> * create input matrix A based on input dimensions (terms)
> * calculate the full transform, which retains the output in "term-space"
AHat = U*Sigma*V^t
> * at the end AHat should be in term-space but transformed by DR and Sigma
weighted for PCA, right?
> * then AHat can be substituted for A where analysis or examination needs
the original dimension definitions (terms).
>
> The problem with the options is that when you set --uHalfSigma OR
--vHalfSigma it sets the sVectors to sqrt and that will cause it to be
applied to U and V since UJob and VJob only check to see if the sVectors
exist and then they both apply them. In other words, either --uHalfSigma is
set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.

I dont think this statement is correct. Ill check for it but i am pretty
sure this is not how it works. Left and right singular vectors scaled to
similar space individually on an explicit option to do so.

>
> To do U*Sigma*V^t  the SSVD code would have to be changed or the U *
Sigma would have to be calculated outside SSVD (an ugly alternative).
>
> But please correct me where I'm wrong.
>
>
> On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:
>
> I appreciate it and believe that this will help others too. I also agree
that we should think this one through to see if it is the correct approach.
>
> I need to figure out why the row ids/Keys of the input matrix are not
getting through clustering. Key/row ids are getting through rowsimilarity
when applied to U*S so why not clustering? In other words with
rowsimilarity I can map the results back to the original input rows
(documents in my case).
>
> As to the U*S option, I agree. I modified the code to take a new --uSigma
but it is mutually exclusive of --uHalfSigma and that indicates that
neither option should be a boolean. They both also imply calculation of U.
I assume this is what you meant below.
>
> Another thing to consider and here my ignorance shows through… The U*S
(equivalent to A*V) transform to V-space must be reversible so that humans
can see results in terms of of the original term-space. Weights base on the
new basis are not human understandable really. But setting me straight here
may be another conversation.
>
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for k-means.
>
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
>
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
> > Yes, I would love to use namedvectors. But no matter doing a key to row
lookup is easy enough.
> >
> > I'm not getting any id at all in the cluster data, not even a key for a
row.
> >
> > I'm beginning to think this is a clustering problem since rowsimilarity
at least gives me row keys to identify objects associated with an object.
> >
> > On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > yeah seq2sparse seems to have -nv option now to churn out named
> > vectors too. It doesn't seem to be listed in the MIA book though.
> >
> > On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
> >> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
> >> Sequence file keys, on the other hand, is
> >>> what populated by seq2sparse output, so they are useful for mapping
> >>> results to original documents.
> >>
> >> Although honestly i am not so sure about seq2sparse anymore. There has
> >> been some time since i looked at this for the last time.
> >
>
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
Another issue with the SSVD job options is that if you want to use SSVD but keep the input in the original dimention-space (term-space in many cases) you would do the following
* create input matrix A based on input dimensions (terms)
* calculate the full transform, which retains the output in "term-space" AHat = U*Sigma*V^t
* at the end AHat should be in term-space but transformed by DR and Sigma weighted for PCA, right?
* then AHat can be substituted for A where analysis or examination needs the original dimension definitions (terms).

The problem with the options is that when you set --uHalfSigma OR --vHalfSigma it sets the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only check to see if the sVectors exist and then they both apply them. In other words, either --uHalfSigma is set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V.

To do U*Sigma*V^t  the SSVD code would have to be changed or the U * Sigma would have to be calculated outside SSVD (an ugly alternative).

But please correct me where I'm wrong.


On Sep 8, 2012, at 10:52 AM, Pat Ferrel <pa...@gmail.com> wrote:

I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach.

I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case).

As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below.

Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.

On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I can do a patch to propagate names of named vectors from A to U too
if that's a requirement for what you do. But we need to make sure it
solves your problem. i am still not sure what are IDs in your
definition and what is required for k-means.

Thinking of that, it's probably a worthy patch anyway. I'll write
something up along with API changes for A*Sigma outputs. I think since
there are so many output options, they should be redesigned not to be
mutually exclusive.

On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
> 
> I'm not getting any id at all in the cluster data, not even a key for a row.
> 
> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
> 
> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> yeah seq2sparse seems to have -nv option now to churn out named
> vectors too. It doesn't seem to be listed in the MIA book though.
> 
> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Sequence file keys, on the other hand, is
>>> what populated by seq2sparse output, so they are useful for mapping
>>> results to original documents.
>> 
>> Although honestly i am not so sure about seq2sparse anymore. There has
>> been some time since i looked at this for the last time.
> 



Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
>
> Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.

Yes. fold-in and fold-out of new observations.

In my experience, fold-in is usually of much smaller volume at a time
and  requires a good index on V. Since Mahout is essentially a batch
analytics it is not good fit for Mahout IMO per se. (not that it is
that difficult to build on your own either).

Fold-out ... i am not particularly sure what it would be useful for
(perhaps for cluster centroids in your case, yes) but it would be a
small job too, compared to initial corpus data, not a big one.
Therefore, not much interest for having that in Mahout either. Result
post-processing once it is not so bulk and even not so science-laden,
is probably of little interest for this project..

I was giving it a though for some time now and had this idea to have a
good Mahout-R integration package for result post processing and
plotting and was somewhat short on pragmatic interest and time to do
that. Since fold-in or fold-out doesn't generate a lot of flops, doing
that in R actually is more than feasible. And also a whole lot more
customizable.

>
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for k-means.
>
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
>
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
>>
>> I'm not getting any id at all in the cluster data, not even a key for a row.
>>
>> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
>>
>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> yeah seq2sparse seems to have -nv option now to churn out named
>> vectors too. It doesn't seem to be listed in the MIA book though.
>>
>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> Sequence file keys, on the other hand, is
>>>> what populated by seq2sparse output, so they are useful for mapping
>>>> results to original documents.
>>>
>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>> been some time since i looked at this for the last time.
>>
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach.

I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case).

As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below.

Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation.

On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

I can do a patch to propagate names of named vectors from A to U too
if that's a requirement for what you do. But we need to make sure it
solves your problem. i am still not sure what are IDs in your
definition and what is required for k-means.

Thinking of that, it's probably a worthy patch anyway. I'll write
something up along with API changes for A*Sigma outputs. I think since
there are so many output options, they should be redesigned not to be
mutually exclusive.

On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
> 
> I'm not getting any id at all in the cluster data, not even a key for a row.
> 
> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
> 
> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> yeah seq2sparse seems to have -nv option now to churn out named
> vectors too. It doesn't seem to be listed in the MIA book though.
> 
> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Sequence file keys, on the other hand, is
>>> what populated by seq2sparse output, so they are useful for mapping
>>> results to original documents.
>> 
>> Although honestly i am not so sure about seq2sparse anymore. There has
>> been some time since i looked at this for the last time.
> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I can do a patch to propagate names of named vectors from A to U too
if that's a requirement for what you do. But we need to make sure it
solves your problem. i am still not sure what are IDs in your
definition and what is required for k-means.

Thinking of that, it's probably a worthy patch anyway. I'll write
something up along with API changes for A*Sigma outputs. I think since
there are so many output options, they should be redesigned not to be
mutually exclusive.

On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <pa...@gmail.com> wrote:
> Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough.
>
> I'm not getting any id at all in the cluster data, not even a key for a row.
>
> I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.
>
> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> yeah seq2sparse seems to have -nv option now to churn out named
> vectors too. It doesn't seem to be listed in the MIA book though.
>
> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Sequence file keys, on the other hand, is
>>> what populated by seq2sparse output, so they are useful for mapping
>>> results to original documents.
>>
>> Although honestly i am not so sure about seq2sparse anymore. There has
>> been some time since i looked at this for the last time.
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough. 

I'm not getting any id at all in the cluster data, not even a key for a row. 

I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object.

On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

yeah seq2sparse seems to have -nv option now to churn out named
vectors too. It doesn't seem to be listed in the MIA book though.

On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Sequence file keys, on the other hand, is
>> what populated by seq2sparse output, so they are useful for mapping
>> results to original documents.
> 
> Although honestly i am not so sure about seq2sparse anymore. There has
> been some time since i looked at this for the last time.


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
yeah seq2sparse seems to have -nv option now to churn out named
vectors too. It doesn't seem to be listed in the MIA book though.

On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>  Sequence file keys, on the other hand, is
>> what populated by seq2sparse output, so they are useful for mapping
>> results to original documents.
>
> Although honestly i am not so sure about seq2sparse anymore. There has
> been some time since i looked at this for the last time.

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
 Sequence file keys, on the other hand, is
> what populated by seq2sparse output, so they are useful for mapping
> results to original documents.

Although honestly i am not so sure about seq2sparse anymore. There has
been some time since i looked at this for the last time.

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Fri, Sep 7, 2012 at 2:12 PM, Pat Ferrel <pa...@gmail.com> wrote:
>
> Or clustering cannot map vectors to clusters since the KEYs are lost and only the IDs are kept, same for rowsimilarity. The clusteredPoints/part file contains a mapping of cluster ID to vector ID, neither of which is a KEY afaik.

I guess i still have problem understanding your definition of ID. Do
you mean that is returned by NamedVector#getName()?

>
> However it looks like the jobs are preserving the VectorWritable so I'm not sure why the IDs are not there. Let me look deeper.

It is just because named vector is not supported. There were no use
case for that before you. Sequence file keys, on the other hand, is
what populated by seq2sparse output, so they are useful for mapping
results to original documents.

>
> On Sep 7, 2012, at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> just checked again. yes U inherits keys from sequence file of A. (are
> we talking about keys of the sequence files or names of a NamedVector?
> NamedVector is not supported, keys of sequence files are.)
>
> On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> More specifically, the way it works, Q matrix inherits keys of A rows
>> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>>
>> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>>
>>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>>>
>>> That should not be the case. Ids in rows of U are inherited from rows
>>> of A. (should be at least).
>>>
>>>>
>>>> For
>>>>   BtJob.run(conf,
>>>>               inputPath,
>>>>               qPath,
>>>>               pcaMeanPath,
>>>>               btPath,
>>>>               minSplitSize,
>>>>               k,
>>>>               p,
>>>>               outerBlockHeight,
>>>>               q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>>               broadcast,
>>>>               labelType,
>>>>               q <= 0);
>>>>
>>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>>
>>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>>>
>>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>>>
>>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>>>
>>> Yes, that's the idea.
>>>
>>> B^t matrix will not have the ideas, not sure why you are looking
>>> there. you need U matrix. Which is solved by another job.
>>>
>>>>
>>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>>
>>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>>
>>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>>> proportions for doing clustering and similarity (though not sure if this is
>>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>>> instead of sqrt Sigma.
>>>>>
>>>>> Since I have no other reason to load U row by row I could write another
>>>>> transform and keep it out of the mahout core but doing this in a patch
>>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>>> option looks like the hardest part actually). For the API there needs to be
>>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>>> extra flag check in the setup for the UJob, something like the following
>>>>>
>>>>>    if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>>      sValues = SSVDHelper.loadVector(sigmaPath,
>>>>> context.getConfiguration());
>>>>>      // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>>> for Sigma weighting
>>>>>    }
>>>>>
>>>>> sValues is already applied to U in the map, which would remain unchanged
>>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>>> sValues.
>>>>>
>>>>>    if (sValues != null) {
>>>>>      for (int i = 0; i < k; i++) {
>>>>>        uRow.setQuick(i,
>>>>>                      qRow.dot(uHat.viewColumn(i)) *
>>>>> sValues.getQuick(i));
>>>>>      }
>>>>>    } else {
>>>>>      …
>>>>>
>>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>>
>>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>
>>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>>> have to turn the singular values vector (diagonal values) into a
>>>>> distributed row matrix or write your own multiply function, correct?
>>>>>
>>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>>> additionally encoded as a very short vector of singular values of
>>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>>> reason to encode it as Distributed row matrix.
>>>>>
>>>>> Multiplication can be done on the fly as you load U, row by row:
>>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>>
>>>>> One inconvenience with that approach is that it assumes you can freely
>>>>> hack the code that loads U matrix for further processing.
>>>>>
>>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>>> same logic as above (requires a patch) or just have it output
>>>>> U*Sigma^0.5 (does not require a patch).
>>>>>
>>>>> You could even use U in some cases directly, but part of the problem
>>>>> is that data variances will be normalized in all directions compared
>>>>> to PCA space, which will affect actual distances between data points.
>>>>> If you want to retain proportions of the directional variances as in
>>>>> your original input, you need to use principal components with scaling
>>>>> applied, i.e. U*Sigma.
>>>>>
>>>>>
>>>>>
>>>>
>
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
OK, maybe I have found it but not sure what to do.

    protected void map(Writable key, VectorWritable value, Context context)
      throws IOException, InterruptedException {

      mapContext = context;
      // output Bt outer products
      Vector aRow = value.get();

      Vector qRow = qr.next();
      int kp = qRow.size();
      qRowValue.set(qRow);

      // make sure Qs are inheriting A row labels.
      outputQRow(key, qRowValue);

In the debugger the VectorWritable value has the textual id in it but  qRow is actually a DenseVector with no id attached.

Is this because of the "3rd party eigensolver"? Wild guess of the moment.

In any case is there a reason not to use a version of vector that preserves the ID?


On Sep 7, 2012, at 2:02 PM, Pat Ferrel <pa...@gmail.com> wrote:

OK so this implies I need to create and index of A mapping names/ids to keys then use the map on U.

Actually I think U will have to be converted to contain IDs or kmeans and similarity won't produce meaningful results.

U must be of the form:
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: 20:{2127:1.0}
Key: 1: Value: 53:{1:0.04140155813392109,23:0.04761729906397759,33:0.04140155813392109,35:0.03874202735546817,50:0.03318442428909763,69:0.04140155813392109,90:0.03993791262049265,100:0.04140155813392109,105:0.04140155813392109,119:0.03993791262049265,124:0.04140155813392109,133:0.036082496577015254,138:0.04140155813392109,143:0.03993791262049265,165:0.04140155813392109,167:0.05648073565065591,172:0.039937

Or clustering cannot map vectors to clusters since the KEYs are lost and only the IDs are kept, same for rowsimilarity. The clusteredPoints/part file contains a mapping of cluster ID to vector ID, neither of which is a KEY afaik.

However it looks like the jobs are preserving the VectorWritable so I'm not sure why the IDs are not there. Let me look deeper.

On Sep 7, 2012, at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

just checked again. yes U inherits keys from sequence file of A. (are
we talking about keys of the sequence files or names of a NamedVector?
NamedVector is not supported, keys of sequence files are.)

On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
> 
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>> 
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>> 
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>> 
>>> 
>>> For
>>>   BtJob.run(conf,
>>>               inputPath,
>>>               qPath,
>>>               pcaMeanPath,
>>>               btPath,
>>>               minSplitSize,
>>>               k,
>>>               p,
>>>               outerBlockHeight,
>>>               q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>               broadcast,
>>>               labelType,
>>>               q <= 0);
>>> 
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>> 
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>> 
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>> 
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>> 
>> Yes, that's the idea.
>> 
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>> 
>>> 
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>> 
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>> 
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>> 
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>> 
>>>>    if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>      sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>      // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>    }
>>>> 
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>> 
>>>>    if (sValues != null) {
>>>>      for (int i = 0; i < k; i++) {
>>>>        uRow.setQuick(i,
>>>>                      qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>      }
>>>>    } else {
>>>>      …
>>>> 
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>> 
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> 
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>> 
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>> 
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>> 
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>> 
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>> 
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>> 
>>>> 
>>>> 
>>> 



Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
OK so this implies I need to create and index of A mapping names/ids to keys then use the map on U.

Actually I think U will have to be converted to contain IDs or kmeans and similarity won't produce meaningful results.

U must be of the form:
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: 20:{2127:1.0}
Key: 1: Value: 53:{1:0.04140155813392109,23:0.04761729906397759,33:0.04140155813392109,35:0.03874202735546817,50:0.03318442428909763,69:0.04140155813392109,90:0.03993791262049265,100:0.04140155813392109,105:0.04140155813392109,119:0.03993791262049265,124:0.04140155813392109,133:0.036082496577015254,138:0.04140155813392109,143:0.03993791262049265,165:0.04140155813392109,167:0.05648073565065591,172:0.039937

Or clustering cannot map vectors to clusters since the KEYs are lost and only the IDs are kept, same for rowsimilarity. The clusteredPoints/part file contains a mapping of cluster ID to vector ID, neither of which is a KEY afaik.

However it looks like the jobs are preserving the VectorWritable so I'm not sure why the IDs are not there. Let me look deeper.

On Sep 7, 2012, at 1:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

just checked again. yes U inherits keys from sequence file of A. (are
we talking about keys of the sequence files or names of a NamedVector?
NamedVector is not supported, keys of sequence files are.)

On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
> 
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>> 
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>> 
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>> 
>>> 
>>> For
>>>    BtJob.run(conf,
>>>                inputPath,
>>>                qPath,
>>>                pcaMeanPath,
>>>                btPath,
>>>                minSplitSize,
>>>                k,
>>>                p,
>>>                outerBlockHeight,
>>>                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                broadcast,
>>>                labelType,
>>>                q <= 0);
>>> 
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>> 
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>> 
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>> 
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>> 
>> Yes, that's the idea.
>> 
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>> 
>>> 
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>> 
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>> 
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>> 
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>> 
>>>>     if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>       sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>       // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>     }
>>>> 
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>> 
>>>>     if (sValues != null) {
>>>>       for (int i = 0; i < k; i++) {
>>>>         uRow.setQuick(i,
>>>>                       qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>       }
>>>>     } else {
>>>>       …
>>>> 
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>> 
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> 
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>> 
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>> 
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>> 
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>> 
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>> 
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>> 
>>>> 
>>>> 
>>> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
just checked again. yes U inherits keys from sequence file of A. (are
we talking about keys of the sequence files or names of a NamedVector?
NamedVector is not supported, keys of sequence files are.)

On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>>
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>>
>>>
>>> For
>>>     BtJob.run(conf,
>>>                 inputPath,
>>>                 qPath,
>>>                 pcaMeanPath,
>>>                 btPath,
>>>                 minSplitSize,
>>>                 k,
>>>                 p,
>>>                 outerBlockHeight,
>>>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                 broadcast,
>>>                 labelType,
>>>                 q <= 0);
>>>
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>>
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>>
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>>
>> Yes, that's the idea.
>>
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>>
>>>
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>>
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>>
>>>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>        sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>      }
>>>>
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>>
>>>>      if (sValues != null) {
>>>>        for (int i = 0; i < k; i++) {
>>>>          uRow.setQuick(i,
>>>>                        qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>        }
>>>>      } else {
>>>>        …
>>>>
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>>
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>>
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>>
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>>
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>>
>>>>
>>>>
>>>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yeah. probably we are talking about different things.



On Fri, Sep 7, 2012 at 1:49 PM, Pat Ferrel <pa...@gmail.com> wrote:
> I'm looking in the output. Since I'm not familiar with what the jobs do I am looking everywhere. One thing seems true that ids go in but they do not come out, in any output.
>
> I do find your comments instructive though. My first time messing with mahout job internals.
>
> Maybe we have a terminology problem. The Keys are not the same as ids, right? The Key is an int the ID is Text for one thing. The ID is part of the row/vector and can be text or an int. I call them doc ids because that's what I use them for.
>
> On Sep 7, 2012, at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>>
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>>
>>>
>>> For
>>>    BtJob.run(conf,
>>>                inputPath,
>>>                qPath,
>>>                pcaMeanPath,
>>>                btPath,
>>>                minSplitSize,
>>>                k,
>>>                p,
>>>                outerBlockHeight,
>>>                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                broadcast,
>>>                labelType,
>>>                q <= 0);
>>>
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>>
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>>
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>>
>> Yes, that's the idea.
>>
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>>
>>>
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>>
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>>
>>>>     if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>       sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>       // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>     }
>>>>
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>>
>>>>     if (sValues != null) {
>>>>       for (int i = 0; i < k; i++) {
>>>>         uRow.setQuick(i,
>>>>                       qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>       }
>>>>     } else {
>>>>       …
>>>>
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>>
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>>
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>>
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>>
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>>
>>>>
>>>>
>>>
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
I'm looking in the output. Since I'm not familiar with what the jobs do I am looking everywhere. One thing seems true that ids go in but they do not come out, in any output.

I do find your comments instructive though. My first time messing with mahout job internals.

Maybe we have a terminology problem. The Keys are not the same as ids, right? The Key is an int the ID is Text for one thing. The ID is part of the row/vector and can be text or an int. I call them doc ids because that's what I use them for.

On Sep 7, 2012, at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

More specifically, the way it works, Q matrix inherits keys of A rows
(BtJob line 137), and U inherits keys of Q (UJob line 128).

On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>> 
>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
> 
> That should not be the case. Ids in rows of U are inherited from rows
> of A. (should be at least).
> 
>> 
>> For
>>    BtJob.run(conf,
>>                inputPath,
>>                qPath,
>>                pcaMeanPath,
>>                btPath,
>>                minSplitSize,
>>                k,
>>                p,
>>                outerBlockHeight,
>>                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>                broadcast,
>>                labelType,
>>                q <= 0);
>> 
>> inputPath here contains a distributedRowMatrix with text doc ids.
>> 
>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>> 
>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>> 
>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
> 
> Yes, that's the idea.
> 
> B^t matrix will not have the ideas, not sure why you are looking
> there. you need U matrix. Which is solved by another job.
> 
>> 
>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> Yes you got it, thats what i was proposing before. A very easy patch.
>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>> 
>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>> 
>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>> proportions for doing clustering and similarity (though not sure if this is
>>> strictly required with cosine distance) I probably want to use U * Sigma
>>> instead of sqrt Sigma.
>>> 
>>> Since I have no other reason to load U row by row I could write another
>>> transform and keep it out of the mahout core but doing this in a patch
>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>> option looks like the hardest part actually). For the API there needs to be
>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>> extra flag check in the setup for the UJob, something like the following
>>> 
>>>     if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>       sValues = SSVDHelper.loadVector(sigmaPath,
>>> context.getConfiguration());
>>>       // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>> for Sigma weighting
>>>     }
>>> 
>>> sValues is already applied to U in the map, which would remain unchanged
>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>> sValues.
>>> 
>>>     if (sValues != null) {
>>>       for (int i = 0; i < k; i++) {
>>>         uRow.setQuick(i,
>>>                       qRow.dot(uHat.viewColumn(i)) *
>>> sValues.getQuick(i));
>>>       }
>>>     } else {
>>>       …
>>> 
>>> I'll give this a try and if it seems reasonable submit a patch.
>>> 
>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> 
>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>> have to turn the singular values vector (diagonal values) into a
>>> distributed row matrix or write your own multiply function, correct?
>>> 
>>> You could do that, but why? Sigma is a diagonal matrix (which
>>> additionally encoded as a very short vector of singular values of
>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>> reason to encode it as Distributed row matrix.
>>> 
>>> Multiplication can be done on the fly as you load U, row by row:
>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>> 
>>> One inconvenience with that approach is that it assumes you can freely
>>> hack the code that loads U matrix for further processing.
>>> 
>>> It is much easier to have SSVD to output U*Sigma directly using the
>>> same logic as above (requires a patch) or just have it output
>>> U*Sigma^0.5 (does not require a patch).
>>> 
>>> You could even use U in some cases directly, but part of the problem
>>> is that data variances will be normalized in all directions compared
>>> to PCA space, which will affect actual distances between data points.
>>> If you want to retain proportions of the directional variances as in
>>> your original input, you need to use principal components with scaling
>>> applied, i.e. U*Sigma.
>>> 
>>> 
>>> 
>> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
just checked again. yes U inherits keys from sequence file of A. (are
we talking about keys of the sequence files or names of a NamedVector?
NamedVector is not supported, keys of sequence files are.)

On Fri, Sep 7, 2012 at 1:34 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> More specifically, the way it works, Q matrix inherits keys of A rows
> (BtJob line 137), and U inherits keys of Q (UJob line 128).
>
> On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>>
>>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>>
>> That should not be the case. Ids in rows of U are inherited from rows
>> of A. (should be at least).
>>
>>>
>>> For
>>>     BtJob.run(conf,
>>>                 inputPath,
>>>                 qPath,
>>>                 pcaMeanPath,
>>>                 btPath,
>>>                 minSplitSize,
>>>                 k,
>>>                 p,
>>>                 outerBlockHeight,
>>>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>>                 broadcast,
>>>                 labelType,
>>>                 q <= 0);
>>>
>>> inputPath here contains a distributedRowMatrix with text doc ids.
>>>
>>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>>
>>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>>
>>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>>
>> Yes, that's the idea.
>>
>> B^t matrix will not have the ideas, not sure why you are looking
>> there. you need U matrix. Which is solved by another job.
>>
>>>
>>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Yes you got it, thats what i was proposing before. A very easy patch.
>>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>
>>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>>
>>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>>> proportions for doing clustering and similarity (though not sure if this is
>>>> strictly required with cosine distance) I probably want to use U * Sigma
>>>> instead of sqrt Sigma.
>>>>
>>>> Since I have no other reason to load U row by row I could write another
>>>> transform and keep it out of the mahout core but doing this in a patch
>>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>>> option looks like the hardest part actually). For the API there needs to be
>>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>>> extra flag check in the setup for the UJob, something like the following
>>>>
>>>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>>        sValues = SSVDHelper.loadVector(sigmaPath,
>>>> context.getConfiguration());
>>>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>>> for Sigma weighting
>>>>      }
>>>>
>>>> sValues is already applied to U in the map, which would remain unchanged
>>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>>> sValues.
>>>>
>>>>      if (sValues != null) {
>>>>        for (int i = 0; i < k; i++) {
>>>>          uRow.setQuick(i,
>>>>                        qRow.dot(uHat.viewColumn(i)) *
>>>> sValues.getQuick(i));
>>>>        }
>>>>      } else {
>>>>        …
>>>>
>>>> I'll give this a try and if it seems reasonable submit a patch.
>>>>
>>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>>> have to turn the singular values vector (diagonal values) into a
>>>> distributed row matrix or write your own multiply function, correct?
>>>>
>>>> You could do that, but why? Sigma is a diagonal matrix (which
>>>> additionally encoded as a very short vector of singular values of
>>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>>> reason to encode it as Distributed row matrix.
>>>>
>>>> Multiplication can be done on the fly as you load U, row by row:
>>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>>
>>>> One inconvenience with that approach is that it assumes you can freely
>>>> hack the code that loads U matrix for further processing.
>>>>
>>>> It is much easier to have SSVD to output U*Sigma directly using the
>>>> same logic as above (requires a patch) or just have it output
>>>> U*Sigma^0.5 (does not require a patch).
>>>>
>>>> You could even use U in some cases directly, but part of the problem
>>>> is that data variances will be normalized in all directions compared
>>>> to PCA space, which will affect actual distances between data points.
>>>> If you want to retain proportions of the directional variances as in
>>>> your original input, you need to use principal components with scaling
>>>> applied, i.e. U*Sigma.
>>>>
>>>>
>>>>
>>>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
More specifically, the way it works, Q matrix inherits keys of A rows
(BtJob line 137), and U inherits keys of Q (UJob line 128).

On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>
>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>
> That should not be the case. Ids in rows of U are inherited from rows
> of A. (should be at least).
>
>>
>> For
>>     BtJob.run(conf,
>>                 inputPath,
>>                 qPath,
>>                 pcaMeanPath,
>>                 btPath,
>>                 minSplitSize,
>>                 k,
>>                 p,
>>                 outerBlockHeight,
>>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>                 broadcast,
>>                 labelType,
>>                 q <= 0);
>>
>> inputPath here contains a distributedRowMatrix with text doc ids.
>>
>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>
>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>
>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>
> Yes, that's the idea.
>
> B^t matrix will not have the ideas, not sure why you are looking
> there. you need U matrix. Which is solved by another job.
>
>>
>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Yes you got it, thats what i was proposing before. A very easy patch.
>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>
>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>
>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>> proportions for doing clustering and similarity (though not sure if this is
>>> strictly required with cosine distance) I probably want to use U * Sigma
>>> instead of sqrt Sigma.
>>>
>>> Since I have no other reason to load U row by row I could write another
>>> transform and keep it out of the mahout core but doing this in a patch
>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>> option looks like the hardest part actually). For the API there needs to be
>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>> extra flag check in the setup for the UJob, something like the following
>>>
>>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>        sValues = SSVDHelper.loadVector(sigmaPath,
>>> context.getConfiguration());
>>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>> for Sigma weighting
>>>      }
>>>
>>> sValues is already applied to U in the map, which would remain unchanged
>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>> sValues.
>>>
>>>      if (sValues != null) {
>>>        for (int i = 0; i < k; i++) {
>>>          uRow.setQuick(i,
>>>                        qRow.dot(uHat.viewColumn(i)) *
>>> sValues.getQuick(i));
>>>        }
>>>      } else {
>>>        …
>>>
>>> I'll give this a try and if it seems reasonable submit a patch.
>>>
>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>> have to turn the singular values vector (diagonal values) into a
>>> distributed row matrix or write your own multiply function, correct?
>>>
>>> You could do that, but why? Sigma is a diagonal matrix (which
>>> additionally encoded as a very short vector of singular values of
>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>> reason to encode it as Distributed row matrix.
>>>
>>> Multiplication can be done on the fly as you load U, row by row:
>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>
>>> One inconvenience with that approach is that it assumes you can freely
>>> hack the code that loads U matrix for further processing.
>>>
>>> It is much easier to have SSVD to output U*Sigma directly using the
>>> same logic as above (requires a patch) or just have it output
>>> U*Sigma^0.5 (does not require a patch).
>>>
>>> You could even use U in some cases directly, but part of the problem
>>> is that data variances will be normalized in all directions compared
>>> to PCA space, which will affect actual distances between data points.
>>> If you want to retain proportions of the directional variances as in
>>> your original input, you need to use principal components with scaling
>>> applied, i.e. U*Sigma.
>>>
>>>
>>>
>>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
More specifically, the way it works, Q matrix inherits keys of A rows
(BtJob line 137), and U inherits keys of Q (UJob line 128).

On Fri, Sep 7, 2012 at 1:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>>
>> However I still have no doc ids in U. Has anyone seen a case where they are preserved?
>
> That should not be the case. Ids in rows of U are inherited from rows
> of A. (should be at least).
>
>>
>> For
>>     BtJob.run(conf,
>>                 inputPath,
>>                 qPath,
>>                 pcaMeanPath,
>>                 btPath,
>>                 minSplitSize,
>>                 k,
>>                 p,
>>                 outerBlockHeight,
>>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>>                 broadcast,
>>                 labelType,
>>                 q <= 0);
>>
>> inputPath here contains a distributedRowMatrix with text doc ids.
>>
>> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>>
>> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>>
>> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.
>
> Yes, that's the idea.
>
> B^t matrix will not have the ideas, not sure why you are looking
> there. you need U matrix. Which is solved by another job.
>
>>
>> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Yes you got it, thats what i was proposing before. A very easy patch.
>> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>
>>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>>
>>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>>> proportions for doing clustering and similarity (though not sure if this is
>>> strictly required with cosine distance) I probably want to use U * Sigma
>>> instead of sqrt Sigma.
>>>
>>> Since I have no other reason to load U row by row I could write another
>>> transform and keep it out of the mahout core but doing this in a patch
>>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>>> option looks like the hardest part actually). For the API there needs to be
>>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>>> extra flag check in the setup for the UJob, something like the following
>>>
>>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>>        sValues = SSVDHelper.loadVector(sigmaPath,
>>> context.getConfiguration());
>>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>>> for Sigma weighting
>>>      }
>>>
>>> sValues is already applied to U in the map, which would remain unchanged
>>> since the sigma weighted (instead of sqrt sigma) values will already be in
>>> sValues.
>>>
>>>      if (sValues != null) {
>>>        for (int i = 0; i < k; i++) {
>>>          uRow.setQuick(i,
>>>                        qRow.dot(uHat.viewColumn(i)) *
>>> sValues.getQuick(i));
>>>        }
>>>      } else {
>>>        …
>>>
>>> I'll give this a try and if it seems reasonable submit a patch.
>>>
>>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>>> the SSVD solver. One difficulty is that to perform the multiplication you
>>> have to turn the singular values vector (diagonal values) into a
>>> distributed row matrix or write your own multiply function, correct?
>>>
>>> You could do that, but why? Sigma is a diagonal matrix (which
>>> additionally encoded as a very short vector of singular values of
>>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>>> reason to encode it as Distributed row matrix.
>>>
>>> Multiplication can be done on the fly as you load U, row by row:
>>> U*Sigma[i,j]=U[i,j]*sv[j]
>>>
>>> One inconvenience with that approach is that it assumes you can freely
>>> hack the code that loads U matrix for further processing.
>>>
>>> It is much easier to have SSVD to output U*Sigma directly using the
>>> same logic as above (requires a patch) or just have it output
>>> U*Sigma^0.5 (does not require a patch).
>>>
>>> You could even use U in some cases directly, but part of the problem
>>> is that data variances will be normalized in all directions compared
>>> to PCA space, which will affect actual distances between data points.
>>> If you want to retain proportions of the directional variances as in
>>> your original input, you need to use principal components with scaling
>>> applied, i.e. U*Sigma.
>>>
>>>
>>>
>>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
>
> B^t matrix will not have the ids, not sure why you are looking
> there. you need U matrix. Which is solved by another job.

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>
> However I still have no doc ids in U. Has anyone seen a case where they are preserved?

That should not be the case. Ids in rows of U are inherited from rows
of A. (should be at least).

>
> For
>     BtJob.run(conf,
>                 inputPath,
>                 qPath,
>                 pcaMeanPath,
>                 btPath,
>                 minSplitSize,
>                 k,
>                 p,
>                 outerBlockHeight,
>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>                 broadcast,
>                 labelType,
>                 q <= 0);
>
> inputPath here contains a distributedRowMatrix with text doc ids.
>
> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>
> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>
> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.

Yes, that's the idea.

B^t matrix will not have the ideas, not sure why you are looking
there. you need U matrix. Which is solved by another job.

>
> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Yes you got it, thats what i was proposing before. A very easy patch.
> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>
>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>> proportions for doing clustering and similarity (though not sure if this is
>> strictly required with cosine distance) I probably want to use U * Sigma
>> instead of sqrt Sigma.
>>
>> Since I have no other reason to load U row by row I could write another
>> transform and keep it out of the mahout core but doing this in a patch
>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>> option looks like the hardest part actually). For the API there needs to be
>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>> extra flag check in the setup for the UJob, something like the following
>>
>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>        sValues = SSVDHelper.loadVector(sigmaPath,
>> context.getConfiguration());
>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>> for Sigma weighting
>>      }
>>
>> sValues is already applied to U in the map, which would remain unchanged
>> since the sigma weighted (instead of sqrt sigma) values will already be in
>> sValues.
>>
>>      if (sValues != null) {
>>        for (int i = 0; i < k; i++) {
>>          uRow.setQuick(i,
>>                        qRow.dot(uHat.viewColumn(i)) *
>> sValues.getQuick(i));
>>        }
>>      } else {
>>        …
>>
>> I'll give this a try and if it seems reasonable submit a patch.
>>
>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>> the SSVD solver. One difficulty is that to perform the multiplication you
>> have to turn the singular values vector (diagonal values) into a
>> distributed row matrix or write your own multiply function, correct?
>>
>> You could do that, but why? Sigma is a diagonal matrix (which
>> additionally encoded as a very short vector of singular values of
>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>> reason to encode it as Distributed row matrix.
>>
>> Multiplication can be done on the fly as you load U, row by row:
>> U*Sigma[i,j]=U[i,j]*sv[j]
>>
>> One inconvenience with that approach is that it assumes you can freely
>> hack the code that loads U matrix for further processing.
>>
>> It is much easier to have SSVD to output U*Sigma directly using the
>> same logic as above (requires a patch) or just have it output
>> U*Sigma^0.5 (does not require a patch).
>>
>> You could even use U in some cases directly, but part of the problem
>> is that data variances will be normalized in all directions compared
>> to PCA space, which will affect actual distances between data points.
>> If you want to retain proportions of the directional variances as in
>> your original input, you need to use principal components with scaling
>> applied, i.e. U*Sigma.
>>
>>
>>
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Fri, Sep 7, 2012 at 1:11 PM, Pat Ferrel <pa...@gmail.com> wrote:
> OK, U * Sigma seems to be working in the patch of SSVDSolver.
>
> However I still have no doc ids in U. Has anyone seen a case where they are preserved?

That should not be the case. Ids in rows of U are inherited from rows
of A. (should be at least).

>
> For
>     BtJob.run(conf,
>                 inputPath,
>                 qPath,
>                 pcaMeanPath,
>                 btPath,
>                 minSplitSize,
>                 k,
>                 p,
>                 outerBlockHeight,
>                 q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
>                 broadcast,
>                 labelType,
>                 q <= 0);
>
> inputPath here contains a distributedRowMatrix with text doc ids.
>
> Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.
>
> Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…
>
> Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.

Yes, that's the idea.

B^t matrix will not have the ideas, not sure why you are looking
there. you need U matrix. Which is solved by another job.

>
> On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Yes you got it, thats what i was proposing before. A very easy patch.
> On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
>> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>>
>> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
>> proportions for doing clustering and similarity (though not sure if this is
>> strictly required with cosine distance) I probably want to use U * Sigma
>> instead of sqrt Sigma.
>>
>> Since I have no other reason to load U row by row I could write another
>> transform and keep it out of the mahout core but doing this in a patch
>> seems trivial. Just create a new flag, something like --uSigma (the CLI
>> option looks like the hardest part actually). For the API there needs to be
>> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
>> extra flag check in the setup for the UJob, something like the following
>>
>>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
>> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>>        sValues = SSVDHelper.loadVector(sigmaPath,
>> context.getConfiguration());
>>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
>> for Sigma weighting
>>      }
>>
>> sValues is already applied to U in the map, which would remain unchanged
>> since the sigma weighted (instead of sqrt sigma) values will already be in
>> sValues.
>>
>>      if (sValues != null) {
>>        for (int i = 0; i < k; i++) {
>>          uRow.setQuick(i,
>>                        qRow.dot(uHat.viewColumn(i)) *
>> sValues.getQuick(i));
>>        }
>>      } else {
>>        …
>>
>> I'll give this a try and if it seems reasonable submit a patch.
>>
>> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> When using PCA it's also preferable to use --uHalfSigma to create U with
>> the SSVD solver. One difficulty is that to perform the multiplication you
>> have to turn the singular values vector (diagonal values) into a
>> distributed row matrix or write your own multiply function, correct?
>>
>> You could do that, but why? Sigma is a diagonal matrix (which
>> additionally encoded as a very short vector of singular values of
>> length k, say we denote it as 'sv'). Given that, there's absolutely 0
>> reason to encode it as Distributed row matrix.
>>
>> Multiplication can be done on the fly as you load U, row by row:
>> U*Sigma[i,j]=U[i,j]*sv[j]
>>
>> One inconvenience with that approach is that it assumes you can freely
>> hack the code that loads U matrix for further processing.
>>
>> It is much easier to have SSVD to output U*Sigma directly using the
>> same logic as above (requires a patch) or just have it output
>> U*Sigma^0.5 (does not require a patch).
>>
>> You could even use U in some cases directly, but part of the problem
>> is that data variances will be normalized in all directions compared
>> to PCA space, which will affect actual distances between data points.
>> If you want to retain proportions of the directional variances as in
>> your original input, you need to use principal components with scaling
>> applied, i.e. U*Sigma.
>>
>>
>>
>

Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
OK, U * Sigma seems to be working in the patch of SSVDSolver.

However I still have no doc ids in U. Has anyone seen a case where they are preserved?

For
    BtJob.run(conf,
                inputPath,
                qPath,
                pcaMeanPath,
                btPath,
                minSplitSize,
                k,
                p,
                outerBlockHeight,
                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
                broadcast,
                labelType,
                q <= 0);

inputPath here contains a distributedRowMatrix with text doc ids.

Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.

Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…

Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.

On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Yes you got it, thats what i was proposing before. A very easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
> 
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though not sure if this is
> strictly required with cosine distance) I probably want to use U * Sigma
> instead of sqrt Sigma.
> 
> Since I have no other reason to load U row by row I could write another
> transform and keep it out of the mahout core but doing this in a patch
> seems trivial. Just create a new flag, something like --uSigma (the CLI
> option looks like the hardest part actually). For the API there needs to be
> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
> extra flag check in the setup for the UJob, something like the following
> 
>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>        sValues = SSVDHelper.loadVector(sigmaPath,
> context.getConfiguration());
>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
> for Sigma weighting
>      }
> 
> sValues is already applied to U in the map, which would remain unchanged
> since the sigma weighted (instead of sqrt sigma) values will already be in
> sValues.
> 
>      if (sValues != null) {
>        for (int i = 0; i < k; i++) {
>          uRow.setQuick(i,
>                        qRow.dot(uHat.viewColumn(i)) *
> sValues.getQuick(i));
>        }
>      } else {
>        …
> 
> I'll give this a try and if it seems reasonable submit a patch.
> 
> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> When using PCA it's also preferable to use --uHalfSigma to create U with
> the SSVD solver. One difficulty is that to perform the multiplication you
> have to turn the singular values vector (diagonal values) into a
> distributed row matrix or write your own multiply function, correct?
> 
> You could do that, but why? Sigma is a diagonal matrix (which
> additionally encoded as a very short vector of singular values of
> length k, say we denote it as 'sv'). Given that, there's absolutely 0
> reason to encode it as Distributed row matrix.
> 
> Multiplication can be done on the fly as you load U, row by row:
> U*Sigma[i,j]=U[i,j]*sv[j]
> 
> One inconvenience with that approach is that it assumes you can freely
> hack the code that loads U matrix for further processing.
> 
> It is much easier to have SSVD to output U*Sigma directly using the
> same logic as above (requires a patch) or just have it output
> U*Sigma^0.5 (does not require a patch).
> 
> You could even use U in some cases directly, but part of the problem
> is that data variances will be normalized in all directions compared
> to PCA space, which will affect actual distances between data points.
> If you want to retain proportions of the directional variances as in
> your original input, you need to use principal components with scaling
> applied, i.e. U*Sigma.
> 
> 
> 


Re: SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
OK, U * Sigma seems to be working in the patch of SSVDSolver.

However I still have no doc ids in U. Has anyone seen a case where they are preserved?

For
    BtJob.run(conf,
                inputPath,
                qPath,
                pcaMeanPath,
                btPath,
                minSplitSize,
                k,
                p,
                outerBlockHeight,
                q <= 0 ? Math.min(1000, reduceTasks) : reduceTasks,
                broadcast,
                labelType,
                q <= 0);

inputPath here contains a distributedRowMatrix with text doc ids.

Bt-job/part-r-00000 has no ids after the BtJob. Not sure where else to look for them and BtJob is the only place the input matrix is used, the rest are intermediates afaict and anyway don't have ids either.

Is something in BtJob stripping them? It looks like ids are ignored in the MR code but maybe its hidden…

Are the Keys of U guaranteed  to be the same as A? If so I could construct an index for A and use it on U but it would be nice to get them out of the solver.

On Sep 7, 2012, at 9:18 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Yes you got it, thats what i was proposing before. A very easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
> 
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though not sure if this is
> strictly required with cosine distance) I probably want to use U * Sigma
> instead of sqrt Sigma.
> 
> Since I have no other reason to load U row by row I could write another
> transform and keep it out of the mahout core but doing this in a patch
> seems trivial. Just create a new flag, something like --uSigma (the CLI
> option looks like the hardest part actually). For the API there needs to be
> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
> extra flag check in the setup for the UJob, something like the following
> 
>      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>        sValues = SSVDHelper.loadVector(sigmaPath,
> context.getConfiguration());
>        // sValues.assign(Functions.SQRT);  // no need to take the sqrt
> for Sigma weighting
>      }
> 
> sValues is already applied to U in the map, which would remain unchanged
> since the sigma weighted (instead of sqrt sigma) values will already be in
> sValues.
> 
>      if (sValues != null) {
>        for (int i = 0; i < k; i++) {
>          uRow.setQuick(i,
>                        qRow.dot(uHat.viewColumn(i)) *
> sValues.getQuick(i));
>        }
>      } else {
>        …
> 
> I'll give this a try and if it seems reasonable submit a patch.
> 
> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> When using PCA it's also preferable to use --uHalfSigma to create U with
> the SSVD solver. One difficulty is that to perform the multiplication you
> have to turn the singular values vector (diagonal values) into a
> distributed row matrix or write your own multiply function, correct?
> 
> You could do that, but why? Sigma is a diagonal matrix (which
> additionally encoded as a very short vector of singular values of
> length k, say we denote it as 'sv'). Given that, there's absolutely 0
> reason to encode it as Distributed row matrix.
> 
> Multiplication can be done on the fly as you load U, row by row:
> U*Sigma[i,j]=U[i,j]*sv[j]
> 
> One inconvenience with that approach is that it assumes you can freely
> hack the code that loads U matrix for further processing.
> 
> It is much easier to have SSVD to output U*Sigma directly using the
> same logic as above (requires a patch) or just have it output
> U*Sigma^0.5 (does not require a patch).
> 
> You could even use U in some cases directly, but part of the problem
> is that data variances will be normalized in all directions compared
> to PCA space, which will affect actual distances between data points.
> If you want to retain proportions of the directional variances as in
> your original input, you need to use principal components with scaling
> applied, i.e. U*Sigma.
> 
> 
> 


Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yes you got it, thats what i was proposing before. A very easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though not sure if this is
> strictly required with cosine distance) I probably want to use U * Sigma
> instead of sqrt Sigma.
>
> Since I have no other reason to load U row by row I could write another
> transform and keep it out of the mahout core but doing this in a patch
> seems trivial. Just create a new flag, something like --uSigma (the CLI
> option looks like the hardest part actually). For the API there needs to be
> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
> extra flag check in the setup for the UJob, something like the following
>
>       if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>         sValues = SSVDHelper.loadVector(sigmaPath,
> context.getConfiguration());
>         // sValues.assign(Functions.SQRT);  // no need to take the sqrt
> for Sigma weighting
>       }
>
> sValues is already applied to U in the map, which would remain unchanged
> since the sigma weighted (instead of sqrt sigma) values will already be in
> sValues.
>
>       if (sValues != null) {
>         for (int i = 0; i < k; i++) {
>           uRow.setQuick(i,
>                         qRow.dot(uHat.viewColumn(i)) *
> sValues.getQuick(i));
>         }
>       } else {
>         …
>
> I'll give this a try and if it seems reasonable submit a patch.
>
> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > When using PCA it's also preferable to use --uHalfSigma to create U with
> the SSVD solver. One difficulty is that to perform the multiplication you
> have to turn the singular values vector (diagonal values) into a
> distributed row matrix or write your own multiply function, correct?
>
> You could do that, but why? Sigma is a diagonal matrix (which
> additionally encoded as a very short vector of singular values of
> length k, say we denote it as 'sv'). Given that, there's absolutely 0
> reason to encode it as Distributed row matrix.
>
> Multiplication can be done on the fly as you load U, row by row:
> U*Sigma[i,j]=U[i,j]*sv[j]
>
> One inconvenience with that approach is that it assumes you can freely
> hack the code that loads U matrix for further processing.
>
> It is much easier to have SSVD to output U*Sigma directly using the
> same logic as above (requires a patch) or just have it output
> U*Sigma^0.5 (does not require a patch).
>
> You could even use U in some cases directly, but part of the problem
> is that data variances will be normalized in all directions compared
> to PCA space, which will affect actual distances between data points.
> If you want to retain proportions of the directional variances as in
> your original input, you need to use principal components with scaling
> applied, i.e. U*Sigma.
>
>
>

Re: SSVD compute U * Sigma

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Yes you got it, thats what i was proposing before. A very easy patch.
On Sep 7, 2012 9:11 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".
>
> WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance
> proportions for doing clustering and similarity (though not sure if this is
> strictly required with cosine distance) I probably want to use U * Sigma
> instead of sqrt Sigma.
>
> Since I have no other reason to load U row by row I could write another
> transform and keep it out of the mahout core but doing this in a patch
> seems trivial. Just create a new flag, something like --uSigma (the CLI
> option looks like the hardest part actually). For the API there needs to be
> a new setter something like SSVDSolver#setComputeUSigma(true) then do an
> extra flag check in the setup for the UJob, something like the following
>
>       if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set
> from --uSigma option or SSVDSolver#setComputeUSigma(true)
>         sValues = SSVDHelper.loadVector(sigmaPath,
> context.getConfiguration());
>         // sValues.assign(Functions.SQRT);  // no need to take the sqrt
> for Sigma weighting
>       }
>
> sValues is already applied to U in the map, which would remain unchanged
> since the sigma weighted (instead of sqrt sigma) values will already be in
> sValues.
>
>       if (sValues != null) {
>         for (int i = 0; i < k; i++) {
>           uRow.setQuick(i,
>                         qRow.dot(uHat.viewColumn(i)) *
> sValues.getQuick(i));
>         }
>       } else {
>         …
>
> I'll give this a try and if it seems reasonable submit a patch.
>
> On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >
> > When using PCA it's also preferable to use --uHalfSigma to create U with
> the SSVD solver. One difficulty is that to perform the multiplication you
> have to turn the singular values vector (diagonal values) into a
> distributed row matrix or write your own multiply function, correct?
>
> You could do that, but why? Sigma is a diagonal matrix (which
> additionally encoded as a very short vector of singular values of
> length k, say we denote it as 'sv'). Given that, there's absolutely 0
> reason to encode it as Distributed row matrix.
>
> Multiplication can be done on the fly as you load U, row by row:
> U*Sigma[i,j]=U[i,j]*sv[j]
>
> One inconvenience with that approach is that it assumes you can freely
> hack the code that loads U matrix for further processing.
>
> It is much easier to have SSVD to output U*Sigma directly using the
> same logic as above (requires a patch) or just have it output
> U*Sigma^0.5 (does not require a patch).
>
> You could even use U in some cases directly, but part of the problem
> is that data variances will be normalized in all directions compared
> to PCA space, which will affect actual distances between data points.
> If you want to retain proportions of the directional variances as in
> your original input, you need to use principal components with scaling
> applied, i.e. U*Sigma.
>
>
>

SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".

WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance proportions for doing clustering and similarity (though not sure if this is strictly required with cosine distance) I probably want to use U * Sigma instead of sqrt Sigma.

Since I have no other reason to load U row by row I could write another transform and keep it out of the mahout core but doing this in a patch seems trivial. Just create a new flag, something like --uSigma (the CLI option looks like the hardest part actually). For the API there needs to be a new setter something like SSVDSolver#setComputeUSigma(true) then do an extra flag check in the setup for the UJob, something like the following

      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set from --uSigma option or SSVDSolver#setComputeUSigma(true)
        sValues = SSVDHelper.loadVector(sigmaPath, context.getConfiguration());
        // sValues.assign(Functions.SQRT);  // no need to take the sqrt for Sigma weighting
      }

sValues is already applied to U in the map, which would remain unchanged since the sigma weighted (instead of sqrt sigma) values will already be in sValues.

      if (sValues != null) {
        for (int i = 0; i < k; i++) {
          uRow.setQuick(i,
                        qRow.dot(uHat.viewColumn(i)) * sValues.getQuick(i));
        }
      } else {
	…

I'll give this a try and if it seems reasonable submit a patch.
 
On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver. One difficulty is that to perform the multiplication you have to turn the singular values vector (diagonal values) into a distributed row matrix or write your own multiply function, correct?

You could do that, but why? Sigma is a diagonal matrix (which
additionally encoded as a very short vector of singular values of
length k, say we denote it as 'sv'). Given that, there's absolutely 0
reason to encode it as Distributed row matrix.

Multiplication can be done on the fly as you load U, row by row:
U*Sigma[i,j]=U[i,j]*sv[j]

One inconvenience with that approach is that it assumes you can freely
hack the code that loads U matrix for further processing.

It is much easier to have SSVD to output U*Sigma directly using the
same logic as above (requires a patch) or just have it output
U*Sigma^0.5 (does not require a patch).

You could even use U in some cases directly, but part of the problem
is that data variances will be normalized in all directions compared
to PCA space, which will affect actual distances between data points.
If you want to retain proportions of the directional variances as in
your original input, you need to use principal components with scaling
applied, i.e. U*Sigma.



SSVD compute U * Sigma

Posted by Pat Ferrel <pa...@gmail.com>.
U*Sigma[i,j]=U[i,j]*sv[j] is what I meant by "write your own multiply".

WRT using U * Sigma vs. U * Sigma^(1/2) I do want to retain distance proportions for doing clustering and similarity (though not sure if this is strictly required with cosine distance) I probably want to use U * Sigma instead of sqrt Sigma.

Since I have no other reason to load U row by row I could write another transform and keep it out of the mahout core but doing this in a patch seems trivial. Just create a new flag, something like --uSigma (the CLI option looks like the hardest part actually). For the API there needs to be a new setter something like SSVDSolver#setComputeUSigma(true) then do an extra flag check in the setup for the UJob, something like the following

      if (context.getConfiguration().get(PROP_U_SIGMA) != null) { //set from --uSigma option or SSVDSolver#setComputeUSigma(true)
        sValues = SSVDHelper.loadVector(sigmaPath, context.getConfiguration());
        // sValues.assign(Functions.SQRT);  // no need to take the sqrt for Sigma weighting
      }

sValues is already applied to U in the map, which would remain unchanged since the sigma weighted (instead of sqrt sigma) values will already be in sValues.

      if (sValues != null) {
        for (int i = 0; i < k; i++) {
          uRow.setQuick(i,
                        qRow.dot(uHat.viewColumn(i)) * sValues.getQuick(i));
        }
      } else {
	…

I'll give this a try and if it seems reasonable submit a patch.
 
On Sep 6, 2012, at 1:01 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver. One difficulty is that to perform the multiplication you have to turn the singular values vector (diagonal values) into a distributed row matrix or write your own multiply function, correct?

You could do that, but why? Sigma is a diagonal matrix (which
additionally encoded as a very short vector of singular values of
length k, say we denote it as 'sv'). Given that, there's absolutely 0
reason to encode it as Distributed row matrix.

Multiplication can be done on the fly as you load U, row by row:
U*Sigma[i,j]=U[i,j]*sv[j]

One inconvenience with that approach is that it assumes you can freely
hack the code that loads U matrix for further processing.

It is much easier to have SSVD to output U*Sigma directly using the
same logic as above (requires a patch) or just have it output
U*Sigma^0.5 (does not require a patch).

You could even use U in some cases directly, but part of the problem
is that data variances will be normalized in all directions compared
to PCA space, which will affect actual distances between data points.
If you want to retain proportions of the directional variances as in
your original input, you need to use principal components with scaling
applied, i.e. U*Sigma.



Re: Doing dimensionality reduction with SSVD and Lanczos

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Thu, Sep 6, 2012 at 10:17 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> When using Laczos the recommendation is to use clean eigen vectors as a distributed row matrix--call it V.
>
> A-hat = A^t V^t this per the clusterdump tests DSVD and DSVD2.

I am not quite sure where this comes from. (for once, A is m x n and V
is n x k so product A^t V^t  just does not exist).
PCA space is defined as AV.


>
> Dmitriy and Ted recommend when using SSVD to do:
>
> A-hat = US

Yes. since PCA=AV and AV is approximately the same as U*Sigma.

SSVD already produces U and Sigma  (or U*Sigma^0.5 as single output
product) so there's no need to mess with AV product again.

>
> When using PCA it's also preferable to use --uHalfSigma to create U with the SSVD solver. One difficulty is that to perform the multiplication you have to turn the singular values vector (diagonal values) into a distributed row matrix or write your own multiply function, correct?

You could do that, but why? Sigma is a diagonal matrix (which
additionally encoded as a very short vector of singular values of
length k, say we denote it as 'sv'). Given that, there's absolutely 0
reason to encode it as Distributed row matrix.

Multiplication can be done on the fly as you load U, row by row:
U*Sigma[i,j]=U[i,j]*sv[j]

One inconvenience with that approach is that it assumes you can freely
hack the code that loads U matrix for further processing.

It is much easier to have SSVD to output U*Sigma directly using the
same logic as above (requires a patch) or just have it output
U*Sigma^0.5 (does not require a patch).

You could even use U in some cases directly, but part of the problem
is that data variances will be normalized in all directions compared
to PCA space, which will affect actual distances between data points.
If you want to retain proportions of the directional variances as in
your original input, you need to use principal components with scaling
applied, i.e. U*Sigma.

>
> Questions:
> For SSVD can someone explain why US is preferred? Given A = USV^t how can you ignore the effect of V^t? Is this only for PCA? In other words if you did not use PCA weighting would you ignore V^t?


US is not preferred. (preferred over what?) it's the approximation of
definition of PCA which is (A-M)V (assuming row-wise points) or
U^t(A-M) assuming columnar data points. But since SSVD --pca deals
specifically with row-wise data points for PCA purposes (it uses
column mean subtraction) then what we really have is PCA equiv (A-M)V.
Since (A-M) approx. = USigmaV^t it follows that (A-M)V approx = U*
Sigma.

The rest of this question i don't understand.

> For Lanczos A-hat = A^t V^t seems to strip doc id during transpose, am I mistaken? Also shouldn't A-hat be transposed before performing kmeans or other analysis?
>
>
>
>> Dmitriy said
> With SSVD you need just US  (or U*Sigma in other notation).
> This is your dimensionally reduced output of your original document
> matrix you've run with --pca option.
>
> As Ted suggests, you may also use US^0.5 which is already produced by
> providing --uHalfSigma (or its embedded setter analog). the keys of
> that output (produced by getUPath() call) will already contain your
> Text document ids as sequence file keys.
>
> -d
>
>