You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@gmail.com> on 2012/08/10 01:34:54 UTC

SSVD for dimensional reduction + Kmeans

Quoth Grant Ingersoll: 
> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in: 
> 
> bin/mahout svd (original -> svdOut) 
> bin/mahout cleansvd ... 
> bin/mahout transpose svdOut -> svdT 
> bin/mahout transpose original -> originalT 
> bin/mahout matrixmult originalT svdT -> newMatrix 
> bin/mahout kmeans newMatrix 

I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors? if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob? I get errors when I do so trying to figure out if I'm on the wrong track.

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

This all stems from the fact that rank(thin SVD) <= rank(A). Since
thin svd rank is really k+p, and rank(A)<=min(m,n) it follows that k+p
must be at least <=min(m,n).Obviously in big data m and n are always
high so it is not a problem for SSVD.

And if m and n are same order as desired rank then it is either
unsolvable (if large problem) or solvable in memory (if small).

On Fri, Aug 10, 2012 at 11:22 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> I guess there's one more clarification that might be useful.
>
> SSVD in fact creates a decomposition of (k+p) rank where p is called
> "oversampling" to capture more plausible dimensions with high variance
> of data in case we guessed first k wrong. And then it throws away last
> p singular values and vectors. That allows to reduce rounding errors
> due to imperfectly guessed projection.
>
> In the corner case when k+p=m, we get the corner case of full rank SVD.
>
> The assumption of SSVD is that m >> (k+p) but it still will work for
> as long as (k+p)<=m  including full rank decomposition when p=0 and
> k=m. (it also assumes that n>>k+p too.)
>
> -d
>
> On Fri, Aug 10, 2012 at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>>
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>>
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>>
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>>
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>>
>> thanks.
>> -D
>>
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>
>>>         java.lang.IllegalArgumentException: new m can't be less than n
>>>                 at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>
>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>
>>>         int p = 15; //default value for CLI
>>>         int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>
>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>
>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>
>>>
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Reading "overview and usage" doc linked on that page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>> should help to clarify outputs and usage.
>>>
>>>
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>
>>>> No
>>>>
>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>> no
>>>>
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I guess there's one more clarification that might be useful.

SSVD in fact creates a decomposition of (k+p) rank where p is called
"oversampling" to capture more plausible dimensions with high variance
of data in case we guessed first k wrong. And then it throws away last
p singular values and vectors. That allows to reduce rounding errors
due to imperfectly guessed projection.

In the corner case when k+p=m, we get the corner case of full rank SVD.

The assumption of SSVD is that m >> (k+p) but it still will work for
as long as (k+p)<=m  including full rank decomposition when p=0 and
k=m. (it also assumes that n>>k+p too.)

-d

On Fri, Aug 10, 2012 at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> It happens because of internal constraints stemming from blocking. it
> happens when a split of A (input) has less than (k+p) rows at which
> point blocks are too small (or rather, to short) to successfully
> perform a QR on .
>
> This also means, among other things, k+p cannot be more than your
> total number of rows in the input.
>
> It is also possible that input A is way too wide or k+p is way too big
> so that an arbitrary split does not fetch at least k+p rows of A, but
> in practice i haven't seen such cases in practice yet. If that
> happens, there's an option to increase minSplitSize (which would
> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
> not your case.
>
> But if your input is shorter than k+p, then it is a case too small for
> SSVD. in fact, it probably means you can solve test directly in memory
> with any solver. You can still use SSVD with k=m and p=0 (I think) in
> this case and get exact (non-reduced rank) decomposition equivalent
> with no stochastic effects, but that is not what it is for really.
>
> Assuming your input is m x n, can you tell me please what your m, n, k
> and p are?
>
> thanks.
> -D
>
> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>
>>         java.lang.IllegalArgumentException: new m can't be less than n
>>                 at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>
>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>
>>         int p = 15; //default value for CLI
>>         int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>
>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>
>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>
>>
>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Reading "overview and usage" doc linked on that page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>> should help to clarify outputs and usage.
>>
>>
>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> Quoth Grant Ingersoll:
>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>
>>>>> bin/mahout svd (original -> svdOut)
>>>>> bin/mahout cleansvd ...
>>>>> bin/mahout transpose svdOut -> svdT
>>>>> bin/mahout transpose original -> originalT
>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>> bin/mahout kmeans newMatrix
>>>>
>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>
>>> No
>>>
>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>> no
>>>
>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>
>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I guess some strategy like this will work for a small size test:

k = ...
p = ..
m = ... external knowledge

if ( k+p > m ) {
  p=m-k;
  if ( p < 0 ) {
   k+=p;
   p=0;
  }
}


On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
> not a method pecularity.
>
> The only reason the solution doesn't warn you explicitly is because
> DistributedRowMatrix format, which is just a sequence file of rows,
> would not provide us with an easy way to verify what m actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
>
>
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>> I've attached the modified and running version with testKmeansDSSVD
>>
>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>
>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>
>> Thanks anyway.
>>
>>
>>
>>
>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>>
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>>
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>>
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>>
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>>
>> thanks.
>> -D
>>
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>
>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>         at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>
>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>
>>> int p = 15; //default value for CLI
>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>
>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>
>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>
>>>
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Reading "overview and usage" doc linked on that page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>> should help to clarify outputs and usage.
>>>
>>>
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>
>>>> No
>>>>
>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>> no
>>>>
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>
>>
>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Aug 10, 2012 at 2:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> another piece of information is that according to investigation we
> did, using q=0 seems to underestimate singular values somewhat and
> produces a little better decay than optimal values would have. which
> means variance retained would err on the optimistic side a bit
> (showing a little bit more variance retained than actual).
>
> also i think retaining 99% is reasonable with big corpuses. this

sorry i meant "retaining 99% of data variance during dimensionality
reduction is probably unreasonable with big corpuses".

> desirable threshold could probably be reduced a little bit to
> accomodate pragmatic resource capacity.
>
> On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> note that to estimate variance retained approximately only, you
>> probably don't need to run it with q=1 so you can run with q=0 and not
>> request either V or U but singular values only. That will reduce
>> running time dramatically (perhaps up to 2-5 times faster compared to
>> with q=1 and U and V).
>>
>> On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> With PCA, there's a metric, something called "variance retained".
>>>
>>> One idea of mine to estimate k is described in footnote discussion on
>>> page 5. While it is not possible to compute "PCA variance retained"
>>> metric exactly with an application of a thin SVD (the metric assumes
>>> use of a full rank SVD) it is possible to infer upper estimate for k
>>> given target variance retained (say, 99%) or try some sort of
>>> polynomialy approximated value for the sum of all singular values
>>> given visible decay. Probably requires some simple code in R or matlab
>>> to get reasonable estimate.
>>>
>>> This technique requires running  PCA one time and then estimate
>>> sufficient k given singular values produced on your corpus. If the
>>> action is repetetive and corpus is not changing drastically, you may
>>> infer if you spending too much (or too little) on k for future uses.
>>>
>>> But pragmatically i just use the best k my cluster can compute in the
>>> time i need. my corpus is relatively small and i don't run full corpus
>>> run too often so i can afford some time spent.
>>>
>>> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> The built-in PCA option is one reason I wanted to try SSVD. Building the test was to make sure I understood the external matrix operations before diving in. I expect one primary decision is how to choose k for reduction. I'm hoping to get some noise rejection so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.
>>>>
>>>> You mention in  your paper:
>>>>
>>>> "The valueof k + p directly impacts running time and memory requirements.
>>>> k+p=500 is probably more than reasonable. Typically k + p
>>>> is taken within range 20…200"
>>>>
>>>> So I guess we might start with
>>>>         -p 15 (default)
>>>>         -q 1
>>>>         -k 200
>>>>
>>>> Is there any use in hand inspecting the eigen vectors before choosing the final k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good enough to for inspection?
>>>>
>>>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> BTW if you really are trying to reduce dimensionality, you may want to
>>>> consider --pca option with SSVD, that [i think] will provide with much
>>>> better preserved data variance then just clean SVD (i.e. essentially
>>>> run a PCA space transformation on your data rather than just SVD)
>>>>
>>>> -d
>>>>
>>>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> Got it. Well on to some real and much larger data sets then…
>>>>>
>>>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> i think actually Mahout's Lanczos requires external knowledge of input
>>>>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>>>>> have "other" reasons to know input size but fundamental assumption
>>>>> rank(input)>=rank(thin SVD) still stands about the input but the
>>>>> method doesn't have a goal of verifying it explicitly (which would be
>>>>> kind of hard), and instead either produces 0 eigenvectors or runs into
>>>>> block deficiency.
>>>>>
>>>>> It is however hard to assert whether block deficiency stemmed from
>>>>> input size deficiency vs. split size deficiency, and neither of
>>>>> situations is typical for a real-life SSVD applications, hence error
>>>>> message is somewhat vague.
>>>>>
>>>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>>>>> not a method pecularity.
>>>>>>
>>>>>> The only reason the solution doesn't warn you explicitly is because
>>>>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>>>>> would not provide us with an easy way to verify what m actually is
>>>>>> before it actually iterates over it and runs into block size
>>>>>> deficiency. So if you now m as an external knowledge, it is easy to
>>>>>> avoid being trapped by block height defiicency.
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>>>>
>>>>>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>>>>>
>>>>>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>>>>>>
>>>>>>> Thanks anyway.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>>
>>>>>>> It happens because of internal constraints stemming from blocking. it
>>>>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>>>>> point blocks are too small (or rather, to short) to successfully
>>>>>>> perform a QR on .
>>>>>>>
>>>>>>> This also means, among other things, k+p cannot be more than your
>>>>>>> total number of rows in the input.
>>>>>>>
>>>>>>> It is also possible that input A is way too wide or k+p is way too big
>>>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>>>>>> in practice i haven't seen such cases in practice yet. If that
>>>>>>> happens, there's an option to increase minSplitSize (which would
>>>>>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>>>>>> not your case.
>>>>>>>
>>>>>>> But if your input is shorter than k+p, then it is a case too small for
>>>>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>>>>>> this case and get exact (non-reduced rank) decomposition equivalent
>>>>>>> with no stochastic effects, but that is not what it is for really.
>>>>>>>
>>>>>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>>>>>> and p are?
>>>>>>>
>>>>>>> thanks.
>>>>>>> -D
>>>>>>>
>>>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>>>>>>
>>>>>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>>>>>       at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>>>>
>>>>>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>>>>>>
>>>>>>>> int p = 15; //default value for CLI
>>>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>>>>>>
>>>>>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>>>>>>
>>>>>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Reading "overview and usage" doc linked on that page
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>>>>> should help to clarify outputs and usage.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>>>>> Quoth Grant Ingersoll:
>>>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>>>>>
>>>>>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>>>>>> bin/mahout cleansvd ...
>>>>>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>>>>>> bin/mahout transpose original -> originalT
>>>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>>>>
>>>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>>>>>
>>>>>>>>> No
>>>>>>>>>
>>>>>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>>>>>> no
>>>>>>>>>
>>>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>>>>>>
>>>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

another piece of information is that according to investigation we
did, using q=0 seems to underestimate singular values somewhat and
produces a little better decay than optimal values would have. which
means variance retained would err on the optimistic side a bit
(showing a little bit more variance retained than actual).

also i think retaining 99% is reasonable with big corpuses. this
desirable threshold could probably be reduced a little bit to
accomodate pragmatic resource capacity.

On Fri, Aug 10, 2012 at 2:42 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> note that to estimate variance retained approximately only, you
> probably don't need to run it with q=1 so you can run with q=0 and not
> request either V or U but singular values only. That will reduce
> running time dramatically (perhaps up to 2-5 times faster compared to
> with q=1 and U and V).
>
> On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> With PCA, there's a metric, something called "variance retained".
>>
>> One idea of mine to estimate k is described in footnote discussion on
>> page 5. While it is not possible to compute "PCA variance retained"
>> metric exactly with an application of a thin SVD (the metric assumes
>> use of a full rank SVD) it is possible to infer upper estimate for k
>> given target variance retained (say, 99%) or try some sort of
>> polynomialy approximated value for the sum of all singular values
>> given visible decay. Probably requires some simple code in R or matlab
>> to get reasonable estimate.
>>
>> This technique requires running  PCA one time and then estimate
>> sufficient k given singular values produced on your corpus. If the
>> action is repetetive and corpus is not changing drastically, you may
>> infer if you spending too much (or too little) on k for future uses.
>>
>> But pragmatically i just use the best k my cluster can compute in the
>> time i need. my corpus is relatively small and i don't run full corpus
>> run too often so i can afford some time spent.
>>
>> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> The built-in PCA option is one reason I wanted to try SSVD. Building the test was to make sure I understood the external matrix operations before diving in. I expect one primary decision is how to choose k for reduction. I'm hoping to get some noise rejection so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.
>>>
>>> You mention in  your paper:
>>>
>>> "The valueof k + p directly impacts running time and memory requirements.
>>> k+p=500 is probably more than reasonable. Typically k + p
>>> is taken within range 20…200"
>>>
>>> So I guess we might start with
>>>         -p 15 (default)
>>>         -q 1
>>>         -k 200
>>>
>>> Is there any use in hand inspecting the eigen vectors before choosing the final k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good enough to for inspection?
>>>
>>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> BTW if you really are trying to reduce dimensionality, you may want to
>>> consider --pca option with SSVD, that [i think] will provide with much
>>> better preserved data variance then just clean SVD (i.e. essentially
>>> run a PCA space transformation on your data rather than just SVD)
>>>
>>> -d
>>>
>>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> Got it. Well on to some real and much larger data sets then…
>>>>
>>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> i think actually Mahout's Lanczos requires external knowledge of input
>>>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>>>> have "other" reasons to know input size but fundamental assumption
>>>> rank(input)>=rank(thin SVD) still stands about the input but the
>>>> method doesn't have a goal of verifying it explicitly (which would be
>>>> kind of hard), and instead either produces 0 eigenvectors or runs into
>>>> block deficiency.
>>>>
>>>> It is however hard to assert whether block deficiency stemmed from
>>>> input size deficiency vs. split size deficiency, and neither of
>>>> situations is typical for a real-life SSVD applications, hence error
>>>> message is somewhat vague.
>>>>
>>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>>>> not a method pecularity.
>>>>>
>>>>> The only reason the solution doesn't warn you explicitly is because
>>>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>>>> would not provide us with an easy way to verify what m actually is
>>>>> before it actually iterates over it and runs into block size
>>>>> deficiency. So if you now m as an external knowledge, it is easy to
>>>>> avoid being trapped by block height defiicency.
>>>>>
>>>>>
>>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>>>
>>>>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>>>>
>>>>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>>>>>
>>>>>> Thanks anyway.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>
>>>>>> It happens because of internal constraints stemming from blocking. it
>>>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>>>> point blocks are too small (or rather, to short) to successfully
>>>>>> perform a QR on .
>>>>>>
>>>>>> This also means, among other things, k+p cannot be more than your
>>>>>> total number of rows in the input.
>>>>>>
>>>>>> It is also possible that input A is way too wide or k+p is way too big
>>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>>>>> in practice i haven't seen such cases in practice yet. If that
>>>>>> happens, there's an option to increase minSplitSize (which would
>>>>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>>>>> not your case.
>>>>>>
>>>>>> But if your input is shorter than k+p, then it is a case too small for
>>>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>>>>> this case and get exact (non-reduced rank) decomposition equivalent
>>>>>> with no stochastic effects, but that is not what it is for really.
>>>>>>
>>>>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>>>>> and p are?
>>>>>>
>>>>>> thanks.
>>>>>> -D
>>>>>>
>>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>>>>>
>>>>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>>>>       at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>>>
>>>>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>>>>>
>>>>>>> int p = 15; //default value for CLI
>>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>>>>>
>>>>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>>>>>
>>>>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>>>>
>>>>>>>
>>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>>
>>>>>>> Reading "overview and usage" doc linked on that page
>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>>>> should help to clarify outputs and usage.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>>>> Quoth Grant Ingersoll:
>>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>>>>
>>>>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>>>>> bin/mahout cleansvd ...
>>>>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>>>>> bin/mahout transpose original -> originalT
>>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>>>
>>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>>>>
>>>>>>>> No
>>>>>>>>
>>>>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>>>>> no
>>>>>>>>
>>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>>>>>
>>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

note that to estimate variance retained approximately only, you
probably don't need to run it with q=1 so you can run with q=0 and not
request either V or U but singular values only. That will reduce
running time dramatically (perhaps up to 2-5 times faster compared to
with q=1 and U and V).

On Fri, Aug 10, 2012 at 2:31 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> With PCA, there's a metric, something called "variance retained".
>
> One idea of mine to estimate k is described in footnote discussion on
> page 5. While it is not possible to compute "PCA variance retained"
> metric exactly with an application of a thin SVD (the metric assumes
> use of a full rank SVD) it is possible to infer upper estimate for k
> given target variance retained (say, 99%) or try some sort of
> polynomialy approximated value for the sum of all singular values
> given visible decay. Probably requires some simple code in R or matlab
> to get reasonable estimate.
>
> This technique requires running  PCA one time and then estimate
> sufficient k given singular values produced on your corpus. If the
> action is repetetive and corpus is not changing drastically, you may
> infer if you spending too much (or too little) on k for future uses.
>
> But pragmatically i just use the best k my cluster can compute in the
> time i need. my corpus is relatively small and i don't run full corpus
> run too often so i can afford some time spent.
>
> On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> The built-in PCA option is one reason I wanted to try SSVD. Building the test was to make sure I understood the external matrix operations before diving in. I expect one primary decision is how to choose k for reduction. I'm hoping to get some noise rejection so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.
>>
>> You mention in  your paper:
>>
>> "The valueof k + p directly impacts running time and memory requirements.
>> k+p=500 is probably more than reasonable. Typically k + p
>> is taken within range 20…200"
>>
>> So I guess we might start with
>>         -p 15 (default)
>>         -q 1
>>         -k 200
>>
>> Is there any use in hand inspecting the eigen vectors before choosing the final k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good enough to for inspection?
>>
>> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> BTW if you really are trying to reduce dimensionality, you may want to
>> consider --pca option with SSVD, that [i think] will provide with much
>> better preserved data variance then just clean SVD (i.e. essentially
>> run a PCA space transformation on your data rather than just SVD)
>>
>> -d
>>
>> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> Got it. Well on to some real and much larger data sets then…
>>>
>>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> i think actually Mahout's Lanczos requires external knowledge of input
>>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>>> have "other" reasons to know input size but fundamental assumption
>>> rank(input)>=rank(thin SVD) still stands about the input but the
>>> method doesn't have a goal of verifying it explicitly (which would be
>>> kind of hard), and instead either produces 0 eigenvectors or runs into
>>> block deficiency.
>>>
>>> It is however hard to assert whether block deficiency stemmed from
>>> input size deficiency vs. split size deficiency, and neither of
>>> situations is typical for a real-life SSVD applications, hence error
>>> message is somewhat vague.
>>>
>>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>>> not a method pecularity.
>>>>
>>>> The only reason the solution doesn't warn you explicitly is because
>>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>>> would not provide us with an easy way to verify what m actually is
>>>> before it actually iterates over it and runs into block size
>>>> deficiency. So if you now m as an external knowledge, it is easy to
>>>> avoid being trapped by block height defiicency.
>>>>
>>>>
>>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>>
>>>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>>>
>>>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>>>>
>>>>> Thanks anyway.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> It happens because of internal constraints stemming from blocking. it
>>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>>> point blocks are too small (or rather, to short) to successfully
>>>>> perform a QR on .
>>>>>
>>>>> This also means, among other things, k+p cannot be more than your
>>>>> total number of rows in the input.
>>>>>
>>>>> It is also possible that input A is way too wide or k+p is way too big
>>>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>>>> in practice i haven't seen such cases in practice yet. If that
>>>>> happens, there's an option to increase minSplitSize (which would
>>>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>>>> not your case.
>>>>>
>>>>> But if your input is shorter than k+p, then it is a case too small for
>>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>>>> this case and get exact (non-reduced rank) decomposition equivalent
>>>>> with no stochastic effects, but that is not what it is for really.
>>>>>
>>>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>>>> and p are?
>>>>>
>>>>> thanks.
>>>>> -D
>>>>>
>>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>>>>
>>>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>>>       at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>>
>>>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>>>>
>>>>>> int p = 15; //default value for CLI
>>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>>>>
>>>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>>>>
>>>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>>>
>>>>>>
>>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>
>>>>>> Reading "overview and usage" doc linked on that page
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>>> should help to clarify outputs and usage.
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>>> Quoth Grant Ingersoll:
>>>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>>>
>>>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>>>> bin/mahout cleansvd ...
>>>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>>>> bin/mahout transpose original -> originalT
>>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>>
>>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>>>
>>>>>>> No
>>>>>>>
>>>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>>>> no
>>>>>>>
>>>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>>>>
>>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>>>>
>>>>>
>>>>>
>>>
>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

With PCA, there's a metric, something called "variance retained".

One idea of mine to estimate k is described in footnote discussion on
page 5. While it is not possible to compute "PCA variance retained"
metric exactly with an application of a thin SVD (the metric assumes
use of a full rank SVD) it is possible to infer upper estimate for k
given target variance retained (say, 99%) or try some sort of
polynomialy approximated value for the sum of all singular values
given visible decay. Probably requires some simple code in R or matlab
to get reasonable estimate.

This technique requires running  PCA one time and then estimate
sufficient k given singular values produced on your corpus. If the
action is repetetive and corpus is not changing drastically, you may
infer if you spending too much (or too little) on k for future uses.

But pragmatically i just use the best k my cluster can compute in the
time i need. my corpus is relatively small and i don't run full corpus
run too often so i can afford some time spent.

On Fri, Aug 10, 2012 at 2:14 PM, Pat Ferrel <pa...@gmail.com> wrote:
> The built-in PCA option is one reason I wanted to try SSVD. Building the test was to make sure I understood the external matrix operations before diving in. I expect one primary decision is how to choose k for reduction. I'm hoping to get some noise rejection so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.
>
> You mention in  your paper:
>
> "The valueof k + p directly impacts running time and memory requirements.
> k+p=500 is probably more than reasonable. Typically k + p
> is taken within range 20…200"
>
> So I guess we might start with
>         -p 15 (default)
>         -q 1
>         -k 200
>
> Is there any use in hand inspecting the eigen vectors before choosing the final k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good enough to for inspection?
>
> On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> BTW if you really are trying to reduce dimensionality, you may want to
> consider --pca option with SSVD, that [i think] will provide with much
> better preserved data variance then just clean SVD (i.e. essentially
> run a PCA space transformation on your data rather than just SVD)
>
> -d
>
> On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> Got it. Well on to some real and much larger data sets then…
>>
>> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> i think actually Mahout's Lanczos requires external knowledge of input
>> size too, in part for similar reasons. SSVD doesn't because it doesn't
>> have "other" reasons to know input size but fundamental assumption
>> rank(input)>=rank(thin SVD) still stands about the input but the
>> method doesn't have a goal of verifying it explicitly (which would be
>> kind of hard), and instead either produces 0 eigenvectors or runs into
>> block deficiency.
>>
>> It is however hard to assert whether block deficiency stemmed from
>> input size deficiency vs. split size deficiency, and neither of
>> situations is typical for a real-life SSVD applications, hence error
>> message is somewhat vague.
>>
>> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>>> not a method pecularity.
>>>
>>> The only reason the solution doesn't warn you explicitly is because
>>> DistributedRowMatrix format, which is just a sequence file of rows,
>>> would not provide us with an easy way to verify what m actually is
>>> before it actually iterates over it and runs into block size
>>> deficiency. So if you now m as an external knowledge, it is easy to
>>> avoid being trapped by block height defiicency.
>>>
>>>
>>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>>> I've attached the modified and running version with testKmeansDSSVD
>>>>
>>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>>
>>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>>>
>>>> Thanks anyway.
>>>>
>>>>
>>>>
>>>>
>>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> It happens because of internal constraints stemming from blocking. it
>>>> happens when a split of A (input) has less than (k+p) rows at which
>>>> point blocks are too small (or rather, to short) to successfully
>>>> perform a QR on .
>>>>
>>>> This also means, among other things, k+p cannot be more than your
>>>> total number of rows in the input.
>>>>
>>>> It is also possible that input A is way too wide or k+p is way too big
>>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>>> in practice i haven't seen such cases in practice yet. If that
>>>> happens, there's an option to increase minSplitSize (which would
>>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>>> not your case.
>>>>
>>>> But if your input is shorter than k+p, then it is a case too small for
>>>> SSVD. in fact, it probably means you can solve test directly in memory
>>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>>> this case and get exact (non-reduced rank) decomposition equivalent
>>>> with no stochastic effects, but that is not what it is for really.
>>>>
>>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>>> and p are?
>>>>
>>>> thanks.
>>>> -D
>>>>
>>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>>>
>>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>>       at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>>
>>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>>>
>>>>> int p = 15; //default value for CLI
>>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>>>
>>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>>>
>>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>>
>>>>>
>>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>
>>>>> Reading "overview and usage" doc linked on that page
>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>>> should help to clarify outputs and usage.
>>>>>
>>>>>
>>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>>> Quoth Grant Ingersoll:
>>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>>
>>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>>> bin/mahout cleansvd ...
>>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>>> bin/mahout transpose original -> originalT
>>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>>> bin/mahout kmeans newMatrix
>>>>>>>
>>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>>
>>>>>> No
>>>>>>
>>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>>> no
>>>>>>
>>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>>>
>>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>>>
>>>>
>>>>
>>
>

Re: SSVD for dimensional reduction + Kmeans

Posted by Pat Ferrel <pa...@gmail.com>.

The built-in PCA option is one reason I wanted to try SSVD. Building the test was to make sure I understood the external matrix operations before diving in. I expect one primary decision is how to choose k for reduction. I'm hoping to get some noise rejection so not using it for reduced matrix size so much. We are starting with m = 500,000 and a million or so docs. We get many dups and low value docs in a small web crawl, so lots of noise.

You mention in  your paper:

"The valueof k + p directly impacts running time and memory requirements.
k+p=500 is probably more than reasonable. Typically k + p
is taken within range 20…200" 

So I guess we might start with 
	-p 15 (default)
        -q 1
        -k 200

Is there any use in hand inspecting the eigen vectors before choosing the final k? If so do you get those by choosing k nearly = m or is something like k = 1000 (or ?) good enough to for inspection? 

On Aug 10, 2012, at 12:53 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

BTW if you really are trying to reduce dimensionality, you may want to
consider --pca option with SSVD, that [i think] will provide with much
better preserved data variance then just clean SVD (i.e. essentially
run a PCA space transformation on your data rather than just SVD)

-d

On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
> Got it. Well on to some real and much larger data sets then…
> 
> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> i think actually Mahout's Lanczos requires external knowledge of input
> size too, in part for similar reasons. SSVD doesn't because it doesn't
> have "other" reasons to know input size but fundamental assumption
> rank(input)>=rank(thin SVD) still stands about the input but the
> method doesn't have a goal of verifying it explicitly (which would be
> kind of hard), and instead either produces 0 eigenvectors or runs into
> block deficiency.
> 
> It is however hard to assert whether block deficiency stemmed from
> input size deficiency vs. split size deficiency, and neither of
> situations is typical for a real-life SSVD applications, hence error
> message is somewhat vague.
> 
> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>> not a method pecularity.
>> 
>> The only reason the solution doesn't warn you explicitly is because
>> DistributedRowMatrix format, which is just a sequence file of rows,
>> would not provide us with an easy way to verify what m actually is
>> before it actually iterates over it and runs into block size
>> deficiency. So if you now m as an external knowledge, it is easy to
>> avoid being trapped by block height defiicency.
>> 
>> 
>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>> I've attached the modified and running version with testKmeansDSSVD
>>> 
>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>> 
>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>> 
>>> Thanks anyway.
>>> 
>>> 
>>> 
>>> 
>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> It happens because of internal constraints stemming from blocking. it
>>> happens when a split of A (input) has less than (k+p) rows at which
>>> point blocks are too small (or rather, to short) to successfully
>>> perform a QR on .
>>> 
>>> This also means, among other things, k+p cannot be more than your
>>> total number of rows in the input.
>>> 
>>> It is also possible that input A is way too wide or k+p is way too big
>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>> in practice i haven't seen such cases in practice yet. If that
>>> happens, there's an option to increase minSplitSize (which would
>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>> not your case.
>>> 
>>> But if your input is shorter than k+p, then it is a case too small for
>>> SSVD. in fact, it probably means you can solve test directly in memory
>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>> this case and get exact (non-reduced rank) decomposition equivalent
>>> with no stochastic effects, but that is not what it is for really.
>>> 
>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>> and p are?
>>> 
>>> thanks.
>>> -D
>>> 
>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>> 
>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>       at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>> 
>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>> 
>>>> int p = 15; //default value for CLI
>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>> 
>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>> 
>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>> 
>>>> 
>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> 
>>>> Reading "overview and usage" doc linked on that page
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>> should help to clarify outputs and usage.
>>>> 
>>>> 
>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> Quoth Grant Ingersoll:
>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>> 
>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>> bin/mahout transpose original -> originalT
>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>> 
>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>> 
>>>>> No
>>>>> 
>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>> no
>>>>> 
>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>> 
>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>> 
>>> 
>>> 
>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

BTW if you really are trying to reduce dimensionality, you may want to
consider --pca option with SSVD, that [i think] will provide with much
better preserved data variance then just clean SVD (i.e. essentially
run a PCA space transformation on your data rather than just SVD)

-d

On Fri, Aug 10, 2012 at 11:57 AM, Pat Ferrel <pa...@gmail.com> wrote:
> Got it. Well on to some real and much larger data sets then…
>
> On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> i think actually Mahout's Lanczos requires external knowledge of input
> size too, in part for similar reasons. SSVD doesn't because it doesn't
> have "other" reasons to know input size but fundamental assumption
> rank(input)>=rank(thin SVD) still stands about the input but the
> method doesn't have a goal of verifying it explicitly (which would be
> kind of hard), and instead either produces 0 eigenvectors or runs into
> block deficiency.
>
> It is however hard to assert whether block deficiency stemmed from
> input size deficiency vs. split size deficiency, and neither of
> situations is typical for a real-life SSVD applications, hence error
> message is somewhat vague.
>
> On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
>> not a method pecularity.
>>
>> The only reason the solution doesn't warn you explicitly is because
>> DistributedRowMatrix format, which is just a sequence file of rows,
>> would not provide us with an easy way to verify what m actually is
>> before it actually iterates over it and runs into block size
>> deficiency. So if you now m as an external knowledge, it is easy to
>> avoid being trapped by block height defiicency.
>>
>>
>> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>>> I've attached the modified and running version with testKmeansDSSVD
>>>
>>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>>
>>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>>
>>> Thanks anyway.
>>>
>>>
>>>
>>>
>>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> It happens because of internal constraints stemming from blocking. it
>>> happens when a split of A (input) has less than (k+p) rows at which
>>> point blocks are too small (or rather, to short) to successfully
>>> perform a QR on .
>>>
>>> This also means, among other things, k+p cannot be more than your
>>> total number of rows in the input.
>>>
>>> It is also possible that input A is way too wide or k+p is way too big
>>> so that an arbitrary split does not fetch at least k+p rows of A, but
>>> in practice i haven't seen such cases in practice yet. If that
>>> happens, there's an option to increase minSplitSize (which would
>>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>>> not your case.
>>>
>>> But if your input is shorter than k+p, then it is a case too small for
>>> SSVD. in fact, it probably means you can solve test directly in memory
>>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>>> this case and get exact (non-reduced rank) decomposition equivalent
>>> with no stochastic effects, but that is not what it is for really.
>>>
>>> Assuming your input is m x n, can you tell me please what your m, n, k
>>> and p are?
>>>
>>> thanks.
>>> -D
>>>
>>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>>
>>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>>        at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>>
>>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>>
>>>> int p = 15; //default value for CLI
>>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>>
>>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>>
>>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>>
>>>>
>>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>
>>>> Reading "overview and usage" doc linked on that page
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>>> should help to clarify outputs and usage.
>>>>
>>>>
>>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>>> Quoth Grant Ingersoll:
>>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>>
>>>>>>> bin/mahout svd (original -> svdOut)
>>>>>>> bin/mahout cleansvd ...
>>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>>> bin/mahout transpose original -> originalT
>>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>>> bin/mahout kmeans newMatrix
>>>>>>
>>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>>
>>>>> No
>>>>>
>>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>>> no
>>>>>
>>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>>
>>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>>
>>>
>>>
>

Re: SSVD for dimensional reduction + Kmeans

Posted by Pat Ferrel <pa...@gmail.com>.

Got it. Well on to some real and much larger data sets then…

On Aug 10, 2012, at 11:53 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

i think actually Mahout's Lanczos requires external knowledge of input
size too, in part for similar reasons. SSVD doesn't because it doesn't
have "other" reasons to know input size but fundamental assumption
rank(input)>=rank(thin SVD) still stands about the input but the
method doesn't have a goal of verifying it explicitly (which would be
kind of hard), and instead either produces 0 eigenvectors or runs into
block deficiency.

It is however hard to assert whether block deficiency stemmed from
input size deficiency vs. split size deficiency, and neither of
situations is typical for a real-life SSVD applications, hence error
message is somewhat vague.

On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
> not a method pecularity.
> 
> The only reason the solution doesn't warn you explicitly is because
> DistributedRowMatrix format, which is just a sequence file of rows,
> would not provide us with an easy way to verify what m actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
> 
> 
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>> I've attached the modified and running version with testKmeansDSSVD
>> 
>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>> 
>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>> 
>> Thanks anyway.
>> 
>> 
>> 
>> 
>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>> 
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>> 
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>> 
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>> 
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>> 
>> thanks.
>> -D
>> 
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>> 
>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>        at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>> 
>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>> 
>>> int p = 15; //default value for CLI
>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>> 
>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>> 
>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>> 
>>> 
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> 
>>> Reading "overview and usage" doc linked on that page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>> should help to clarify outputs and usage.
>>> 
>>> 
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>> 
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>> 
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>> 
>>>> No
>>>> 
>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>> no
>>>> 
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>> 
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>> 
>> 
>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

i think actually Mahout's Lanczos requires external knowledge of input
size too, in part for similar reasons. SSVD doesn't because it doesn't
have "other" reasons to know input size but fundamental assumption
rank(input)>=rank(thin SVD) still stands about the input but the
method doesn't have a goal of verifying it explicitly (which would be
kind of hard), and instead either produces 0 eigenvectors or runs into
block deficiency.

It is however hard to assert whether block deficiency stemmed from
input size deficiency vs. split size deficiency, and neither of
situations is typical for a real-life SSVD applications, hence error
message is somewhat vague.

On Fri, Aug 10, 2012 at 11:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
> not a method pecularity.
>
> The only reason the solution doesn't warn you explicitly is because
> DistributedRowMatrix format, which is just a sequence file of rows,
> would not provide us with an easy way to verify what m actually is
> before it actually iterates over it and runs into block size
> deficiency. So if you now m as an external knowledge, it is easy to
> avoid being trapped by block height defiicency.
>
>
> On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
>> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
>> I've attached the modified and running version with testKmeansDSSVD
>>
>> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>>
>> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>>
>> Thanks anyway.
>>
>>
>>
>>
>> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> It happens because of internal constraints stemming from blocking. it
>> happens when a split of A (input) has less than (k+p) rows at which
>> point blocks are too small (or rather, to short) to successfully
>> perform a QR on .
>>
>> This also means, among other things, k+p cannot be more than your
>> total number of rows in the input.
>>
>> It is also possible that input A is way too wide or k+p is way too big
>> so that an arbitrary split does not fetch at least k+p rows of A, but
>> in practice i haven't seen such cases in practice yet. If that
>> happens, there's an option to increase minSplitSize (which would
>> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
>> not your case.
>>
>> But if your input is shorter than k+p, then it is a case too small for
>> SSVD. in fact, it probably means you can solve test directly in memory
>> with any solver. You can still use SSVD with k=m and p=0 (I think) in
>> this case and get exact (non-reduced rank) decomposition equivalent
>> with no stochastic effects, but that is not what it is for really.
>>
>> Assuming your input is m x n, can you tell me please what your m, n, k
>> and p are?
>>
>> thanks.
>> -D
>>
>> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>>
>>> java.lang.IllegalArgumentException: new m can't be less than n
>>>         at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>>
>>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>>
>>> int p = 15; //default value for CLI
>>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>>
>>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>>
>>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>>
>>>
>>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>
>>> Reading "overview and usage" doc linked on that page
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>>> should help to clarify outputs and usage.
>>>
>>>
>>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>>> Quoth Grant Ingersoll:
>>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>>
>>>>>> bin/mahout svd (original -> svdOut)
>>>>>> bin/mahout cleansvd ...
>>>>>> bin/mahout transpose svdOut -> svdT
>>>>>> bin/mahout transpose original -> originalT
>>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>>> bin/mahout kmeans newMatrix
>>>>>
>>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>>
>>>> No
>>>>
>>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>>> no
>>>>
>>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>>
>>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>>
>>
>>

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

The easy answer is to ensure (k+p)<= m. It is mathematical constraint,
not a method pecularity.

The only reason the solution doesn't warn you explicitly is because
DistributedRowMatrix format, which is just a sequence file of rows,
would not provide us with an easy way to verify what m actually is
before it actually iterates over it and runs into block size
deficiency. So if you now m as an external knowledge, it is easy to
avoid being trapped by block height defiicency.


On Fri, Aug 10, 2012 at 11:32 AM, Pat Ferrel <pa...@gmail.com> wrote:
> This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in
> mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java
> I've attached the modified and running version with testKmeansDSSVD
>
> As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.
>
> Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.
>
> Thanks anyway.
>
>
>
>
> On Aug 10, 2012, at 11:07 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> It happens because of internal constraints stemming from blocking. it
> happens when a split of A (input) has less than (k+p) rows at which
> point blocks are too small (or rather, to short) to successfully
> perform a QR on .
>
> This also means, among other things, k+p cannot be more than your
> total number of rows in the input.
>
> It is also possible that input A is way too wide or k+p is way too big
> so that an arbitrary split does not fetch at least k+p rows of A, but
> in practice i haven't seen such cases in practice yet. If that
> happens, there's an option to increase minSplitSize (which would
> undermine MR mappers efficiency  somewhat). But i am pretty sure it is
> not your case.
>
> But if your input is shorter than k+p, then it is a case too small for
> SSVD. in fact, it probably means you can solve test directly in memory
> with any solver. You can still use SSVD with k=m and p=0 (I think) in
> this case and get exact (non-reduced rank) decomposition equivalent
> with no stochastic effects, but that is not what it is for really.
>
> Assuming your input is m x n, can you tell me please what your m, n, k
> and p are?
>
> thanks.
> -D
>
> On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
>> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>>
>> java.lang.IllegalArgumentException: new m can't be less than n
>>         at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>>
>> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>>
>> int p = 15; //default value for CLI
>> int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>>
>> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>>
>> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>>
>>
>> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> Reading "overview and usage" doc linked on that page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
>> should help to clarify outputs and usage.
>>
>>
>> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>>> Quoth Grant Ingersoll:
>>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>>
>>>>> bin/mahout svd (original -> svdOut)
>>>>> bin/mahout cleansvd ...
>>>>> bin/mahout transpose svdOut -> svdT
>>>>> bin/mahout transpose original -> originalT
>>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>>> bin/mahout kmeans newMatrix
>>>>
>>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>>
>>> No
>>>
>>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>>> no
>>>
>>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>>
>>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>>
>
>

Re: SSVD for dimensional reduction + Kmeans

Posted by Pat Ferrel <pa...@gmail.com>.

This is only a test with some trivially simple data. I doubt there are any splits and yes it could easily be done in memory but that is not the purpose. It is based on testKmeansDSVD2, which is in 
mahout/integration/src/test/java/org/apache/mahout/clustering/TestClusterDumper.java 
I've attached the modified and running version with testKmeansDSSVD

As I said I don't think this is a real world test. It tests that the code runs, and it does. Getting the best results is not part of the scope. I just thought if there was an easy answer I could clean up the parameters for SSVDSolver.

Since it is working I don't know that it's worth the effort unless people are likely to run into this with larger data sets.

Thanks anyway.

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

It happens because of internal constraints stemming from blocking. it
happens when a split of A (input) has less than (k+p) rows at which
point blocks are too small (or rather, to short) to successfully
perform a QR on .

This also means, among other things, k+p cannot be more than your
total number of rows in the input.

It is also possible that input A is way too wide or k+p is way too big
so that an arbitrary split does not fetch at least k+p rows of A, but
in practice i haven't seen such cases in practice yet. If that
happens, there's an option to increase minSplitSize (which would
undermine MR mappers efficiency  somewhat). But i am pretty sure it is
not your case.

But if your input is shorter than k+p, then it is a case too small for
SSVD. in fact, it probably means you can solve test directly in memory
with any solver. You can still use SSVD with k=m and p=0 (I think) in
this case and get exact (non-reduced rank) decomposition equivalent
with no stochastic effects, but that is not what it is for really.

Assuming your input is m x n, can you tell me please what your m, n, k
and p are?

thanks.
-D

On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel <pa...@gmail.com> wrote:
> There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:
>
>         java.lang.IllegalArgumentException: new m can't be less than n
>                 at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)
>
> I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.
>
>         int p = 15; //default value for CLI
>         int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works
>
> This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.
>
> This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.
>
>
> On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> Reading "overview and usage" doc linked on that page
> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
> should help to clarify outputs and usage.
>
>
> On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>> Quoth Grant Ingersoll:
>>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>>
>>>> bin/mahout svd (original -> svdOut)
>>>> bin/mahout cleansvd ...
>>>> bin/mahout transpose svdOut -> svdT
>>>> bin/mahout transpose original -> originalT
>>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>>> bin/mahout kmeans newMatrix
>>>
>>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>>
>> No
>>
>>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
>> no
>>
>> SSVD is SVD, meaning it produces U and V with no further need to clean that
>>
>>> I get errors when I do so trying to figure out if I'm on the wrong track.
>

Re: SSVD for dimensional reduction + Kmeans

Posted by Pat Ferrel <pa...@gmail.com>.

There seems to be some internal constraint on k and/or p, which is making a test difficult. The test has a very small input doc set and choosing the wrong k it is very easy to get a failure with this message:

	java.lang.IllegalArgumentException: new m can't be less than n
		at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109)

I have a working test but I had to add some docs to the test data and have tried to reverse engineer the value for k (desiredRank). I came up with the following but I think it is only an accident that it works.

        int p = 15; //default value for CLI
        int desiredRank = sampleData.size() - p - 1;//number of docs - p - 1, ?????? not sure why this works

This seems likely to be an issue only because of the very small data set and the relationship of rows to columns to p to k. But for the purposes of creating a test if someone (Dmitriy?) could tell me how to calculate a reasonable p and k from the dimensions of the tiny data set it would help.

This test is derived from a non-active SVD test but I'd be up for cleaning it up and including it as an example in the working but non-active tests. I also fixed a couple trivial bugs in the non-active Lanczos tests for what it's worth.

On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Reading "overview and usage" doc linked on that page
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
should help to clarify outputs and usage.

On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Quoth Grant Ingersoll:
>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>> 
>>> bin/mahout svd (original -> svdOut)
>>> bin/mahout cleansvd ...
>>> bin/mahout transpose svdOut -> svdT
>>> bin/mahout transpose original -> originalT
>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>> bin/mahout kmeans newMatrix
>> 
>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
> 
> No
> 
>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
> no
> 
> SSVD is SVD, meaning it produces U and V with no further need to clean that
> 
>> I get errors when I do so trying to figure out if I'm on the wrong track.

Re: SSVD for dimensional reduction + Kmeans

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Thanks, I've read the docs and appreciate how thorough they are! 

BTW skipping the cleaning step got the test running.

On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

Reading "overview and usage" doc linked on that page
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
should help to clarify outputs and usage.


On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Quoth Grant Ingersoll:
>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>> 
>>> bin/mahout svd (original -> svdOut)
>>> bin/mahout cleansvd ...
>>> bin/mahout transpose svdOut -> svdT
>>> bin/mahout transpose original -> originalT
>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>> bin/mahout kmeans newMatrix
>> 
>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
> 
> No
> 
>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
> no
> 
> SSVD is SVD, meaning it produces U and V with no further need to clean that
> 
>> I get errors when I do so trying to figure out if I'm on the wrong track.

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Reading "overview and usage" doc linked on that page
https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition
should help to clarify outputs and usage.


On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Quoth Grant Ingersoll:
>>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>>
>>> bin/mahout svd (original -> svdOut)
>>> bin/mahout cleansvd ...
>>> bin/mahout transpose svdOut -> svdT
>>> bin/mahout transpose original -> originalT
>>> bin/mahout matrixmult originalT svdT -> newMatrix
>>> bin/mahout kmeans newMatrix
>>
>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?
>
> No
>
>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
> no
>
> SSVD is SVD, meaning it produces U and V with no further need to clean that
>
>> I get errors when I do so trying to figure out if I'm on the wrong track.

Re: SSVD for dimensional reduction + Kmeans

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel <pa...@gmail.com> wrote:
> Quoth Grant Ingersoll:
>> To put this in bin/mahout speak, this would look like, munging some names and taking liberties with the actual argument to be passed in:
>>
>> bin/mahout svd (original -> svdOut)
>> bin/mahout cleansvd ...
>> bin/mahout transpose svdOut -> svdT
>> bin/mahout transpose original -> originalT
>> bin/mahout matrixmult originalT svdT -> newMatrix
>> bin/mahout kmeans newMatrix
>
> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver. Does SSVD require the EigenVerificationJob to clean the eigen vectors?

No

> if so where does SSVD put the equivalent of DistributedLanczosSolver.RAW_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so should I transpose V* then run it through the EigenVerificationJob?
no

SSVD is SVD, meaning it produces U and V with no further need to clean that

> I get errors when I do so trying to figure out if I'm on the wrong track.