You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@gmail.com> on 2012/08/18 17:28:07 UTC

SSVD + PCA

Switching from API to CLI 

the parameter -t is described in the PDF 

--reduceTasks <int-value> optional. The number of reducers to use (where applicable): depends on the size of the hadoop cluster. At this point it could also be overwritten by a standard hadoop property using -D option
4. Probably always needs to be speciﬁed as by default Hadoop would set it to 1, which is certainly far below the cluster capacity. Recommended value for this option ~ 95% or ~190% of available reducer capacity to allow for opportunistic executions.

The description above seems to say it will be taken from the hadoop config if not specified, which is probably all most people would every want. I am unclear why this is needed? I cannot run SSVD without specifying it, in other words it does not seem to be optional?

As a first try using the CLI I'm running with 295625 rows and 337258 columns using the following parameters to get a sort of worst case run time result with best case data output. The parameters will be tweaked later to get better dimensional reduction and runtime.

    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends on cluster)

Is there work being done to calculate the variance retained for the output or should I calculate it myself?

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Also id sugggest to start with much smaller k. It directly affects task
running time, especially with q>0. I think in Nathan's dissertation he ran
wikipedia set with only 100 singular values which already provide most of
the spectrum if you look and the decay chart. In practice i dont know
anyone who ran it with more than 200. It just reaches point of diminishing
return around 100 where you pay quite a bit for too little additional
information.
On Aug 18, 2012 10:39 AM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:

>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set
> it to 1, which is certainly far below the cluster capacity. Recommended
> value for this option ~ 95% or ~190% of available reducer capacity to allow
> for opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly
> forgetting set the number of reducers and kept coming back with questions
> like why it is running so slow. So there was an issue in 0.7 where i made
> it mandatory. I am actually not sure now other mahout methods ensure
> reducer specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >     mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own
> pipeline for a particular purpose. It also takes a lot of assumptions that
> may or may not hold in a  particular case, such that you do something
> repeatedly and corpuses are of similar nature. Also, i know no paper that
> would do it exactly the way i described, so theres no error estimate on
> either inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ill take a look, although it used to run on output of seq2sparse, i made
sure of it some time ago. This never happened before. Perhaps something got
broken...
On Aug 19, 2012 1:06 AM, "Pat Ferrel" <pa...@gmail.com> wrote:

> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a
> cluster that specifies the default number of mappers and reducers for that
> node. There is a rule of thumb about how many mappers and reducers per
> core. You can tweak them either way depending on your typical jobs.
>
> No idea what you mean about the total reducers being 1 for most configs.
> My very small cluster at home with 10 cores in three machines is configured
> to produce a conservative 10 mappers and 10 reducers, which is about what
> happens with balanced jobs. The reducers = 1 is probably for a
> non-clustered one machine setup.
>
> I'm suspicious that the -t  parameter is not needed but would definitely
> defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>
> Variance Retained
>
> If one batch of data yields a greatly different estimate of VR than
> another, it would be worth noticing, even if we don't know the actual error
> in it. To say that your estimate of VR is valueless would require that we
> have some experience with it, no?
>
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>
>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS remember all matrix input must be in DistributedRowMatrix format.
You can try to validate it by running other matrix algorithms on it
(key type tolerance is somewhat different among distributed matrix
algorithms but the value in the input always has to be VectorWritable
which seems to be violated in your case for some reason not yet
clear).

On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Ok this just means that something in the A input is not really
> adhering to <Writable,VectorWritable> specification. In particular,
> there seems to be a file in the input path that has <?,VectorWritable>
> pair in its input.
>
> Can you check your input files for key/value types? Note that includes
> entire subtree of sequence files, not just files in the input
> directory.
>
> Usually it is visible in the header of the sequence file (usually even
> if it is using compression).
>
> I am not quite sure what you mean by "rowid" processing.
>
>
>
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Getting an odd error on SSVD.
>>
>> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini cluster in parallel. Most of them complete with no errors but there usually two map task failures for each QJob, they die with the error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
>>         at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:416)
>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>> The data was created using seq2sparse and then running rowid to create the input matrix. The data was encoded as named vectors. These are the two differences I could think of between how I ran it from the API and from the CLI.
>>
>>
>> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>> -t Param
>>
>> I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs.
>>
>> No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.
>>
>> I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>>
>> Variance Retained
>>
>> If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no?
>>
>> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>
>>> Switching from API to CLI
>>>
>>> the parameter -t is described in the PDF
>>>
>>> --reduceTasks <int-value> optional. The number of reducers to use (where
>> applicable): depends on the size of the hadoop cluster. At this point it
>> could also be overwritten by a standard hadoop property using -D option
>>> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
>> to 1, which is certainly far below the cluster capacity. Recommended value
>> for this option ~ 95% or ~190% of available reducer capacity to allow for
>> opportunistic executions.
>>>
>>> The description above seems to say it will be taken from the hadoop
>> config if not specified, which is probably all most people would every
>> want. I am unclear why this is needed? I cannot run SSVD without specifying
>> it, in other words it does not seem to be optional?
>>
>> This parameter was made mandatory because people were repeatedly forgetting
>> set the number of reducers and kept coming back with questions like why it
>> is running so slow. So there was an issue in 0.7 where i made it mandatory.
>> I am actually not sure now other mahout methods ensure reducer
>> specification is always specified other than 1
>>
>>>
>>> As a first try using the CLI I'm running with 295625 rows and 337258
>> columns using the following parameters to get a sort of worst case run time
>> result with best case data output. The parameters will be tweaked later to
>> get better dimensional reduction and runtime.
>>>
>>>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
>> on cluster)
>>>
>>> Is there work being done to calculate the variance retained for the
>> output or should I calculate it myself?
>>
>> No theres no work done since it implies your are building your own pipeline
>> for a particular purpose. It also takes a lot of assumptions that may or
>> may not hold in a  particular case, such that you do something repeatedly
>> and corpuses are of similar nature. Also, i know no paper that would do it
>> exactly the way i described, so theres no error estimate on either
>> inequality approach or any sort of decay interpolation.
>>
>> It is not very difficult to experiment a little with your data though with
>> a subset of the corpus and see what may work.
>>
>>

Re: SSVD + PCA

Posted by Pat Ferrel <pa...@gmail.com>.

OMG, sorry but I am a complete idiot. RowId creates a "docIndex" in the matrix dir. Once I specified the full path to the Distributed Row Matrix everything was fine.

On Aug 20, 2012, at 11:09 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Ok this just means that something in the A input is not really
> adhering to <Writable,VectorWritable> specification. In particular,
> there seems to be a file in the input path that has <?,VectorWritable>
> pair in its input.

sorry this should read

> there seems to be a file in the input path that has <?,Text>
> pair in its input.

Input seems to have Text values somewhere.

> 
> Can you check your input files for key/value types? Note that includes
> entire subtree of sequence files, not just files in the input
> directory.
> 
> Usually it is visible in the header of the sequence file (usually even
> if it is using compression).
> 
> I am not quite sure what you mean by "rowid" processing.
> 
> 
> 
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Getting an odd error on SSVD.
>> 
>> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini cluster in parallel. Most of them complete with no errors but there usually two map task failures for each QJob, they die with the error:
>> 
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
>>        at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
>> 
>> The data was created using seq2sparse and then running rowid to create the input matrix. The data was encoded as named vectors. These are the two differences I could think of between how I ran it from the API and from the CLI.
>> 
>> 
>> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> 
>> -t Param
>> 
>> I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs.
>> 
>> No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.
>> 
>> I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>> 
>> Variance Retained
>> 
>> If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no?
>> 
>> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>> 
>>> Switching from API to CLI
>>> 
>>> the parameter -t is described in the PDF
>>> 
>>> --reduceTasks <int-value> optional. The number of reducers to use (where
>> applicable): depends on the size of the hadoop cluster. At this point it
>> could also be overwritten by a standard hadoop property using -D option
>>> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
>> to 1, which is certainly far below the cluster capacity. Recommended value
>> for this option ~ 95% or ~190% of available reducer capacity to allow for
>> opportunistic executions.
>>> 
>>> The description above seems to say it will be taken from the hadoop
>> config if not specified, which is probably all most people would every
>> want. I am unclear why this is needed? I cannot run SSVD without specifying
>> it, in other words it does not seem to be optional?
>> 
>> This parameter was made mandatory because people were repeatedly forgetting
>> set the number of reducers and kept coming back with questions like why it
>> is running so slow. So there was an issue in 0.7 where i made it mandatory.
>> I am actually not sure now other mahout methods ensure reducer
>> specification is always specified other than 1
>> 
>>> 
>>> As a first try using the CLI I'm running with 295625 rows and 337258
>> columns using the following parameters to get a sort of worst case run time
>> result with best case data output. The parameters will be tweaked later to
>> get better dimensional reduction and runtime.
>>> 
>>>  mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
>> on cluster)
>>> 
>>> Is there work being done to calculate the variance retained for the
>> output or should I calculate it myself?
>> 
>> No theres no work done since it implies your are building your own pipeline
>> for a particular purpose. It also takes a lot of assumptions that may or
>> may not hold in a  particular case, such that you do something repeatedly
>> and corpuses are of similar nature. Also, i know no paper that would do it
>> exactly the way i described, so theres no error estimate on either
>> inequality approach or any sort of decay interpolation.
>> 
>> It is not very difficult to experiment a little with your data though with
>> a subset of the corpus and see what may work.
>> 
>>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Ok this just means that something in the A input is not really
> adhering to <Writable,VectorWritable> specification. In particular,
> there seems to be a file in the input path that has <?,VectorWritable>
> pair in its input.

sorry this should read

> there seems to be a file in the input path that has <?,Text>
> pair in its input.

Input seems to have Text values somewhere.

>
> Can you check your input files for key/value types? Note that includes
> entire subtree of sequence files, not just files in the input
> directory.
>
> Usually it is visible in the header of the sequence file (usually even
> if it is using compression).
>
> I am not quite sure what you mean by "rowid" processing.
>
>
>
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <pa...@gmail.com> wrote:
>> Getting an odd error on SSVD.
>>
>> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini cluster in parallel. Most of them complete with no errors but there usually two map task failures for each QJob, they die with the error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
>>         at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:416)
>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>> The data was created using seq2sparse and then running rowid to create the input matrix. The data was encoded as named vectors. These are the two differences I could think of between how I ran it from the API and from the CLI.
>>
>>
>> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>> -t Param
>>
>> I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs.
>>
>> No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.
>>
>> I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>>
>> Variance Retained
>>
>> If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no?
>>
>> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>>
>>> Switching from API to CLI
>>>
>>> the parameter -t is described in the PDF
>>>
>>> --reduceTasks <int-value> optional. The number of reducers to use (where
>> applicable): depends on the size of the hadoop cluster. At this point it
>> could also be overwritten by a standard hadoop property using -D option
>>> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
>> to 1, which is certainly far below the cluster capacity. Recommended value
>> for this option ~ 95% or ~190% of available reducer capacity to allow for
>> opportunistic executions.
>>>
>>> The description above seems to say it will be taken from the hadoop
>> config if not specified, which is probably all most people would every
>> want. I am unclear why this is needed? I cannot run SSVD without specifying
>> it, in other words it does not seem to be optional?
>>
>> This parameter was made mandatory because people were repeatedly forgetting
>> set the number of reducers and kept coming back with questions like why it
>> is running so slow. So there was an issue in 0.7 where i made it mandatory.
>> I am actually not sure now other mahout methods ensure reducer
>> specification is always specified other than 1
>>
>>>
>>> As a first try using the CLI I'm running with 295625 rows and 337258
>> columns using the following parameters to get a sort of worst case run time
>> result with best case data output. The parameters will be tweaked later to
>> get better dimensional reduction and runtime.
>>>
>>>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
>> on cluster)
>>>
>>> Is there work being done to calculate the variance retained for the
>> output or should I calculate it myself?
>>
>> No theres no work done since it implies your are building your own pipeline
>> for a particular purpose. It also takes a lot of assumptions that may or
>> may not hold in a  particular case, such that you do something repeatedly
>> and corpuses are of similar nature. Also, i know no paper that would do it
>> exactly the way i described, so theres no error estimate on either
>> inequality approach or any sort of decay interpolation.
>>
>> It is not very difficult to experiment a little with your data though with
>> a subset of the corpus and see what may work.
>>
>>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Ok this just means that something in the A input is not really
adhering to <Writable,VectorWritable> specification. In particular,
there seems to be a file in the input path that has <?,VectorWritable>
pair in its input.

Can you check your input files for key/value types? Note that includes
entire subtree of sequence files, not just files in the input
directory.

Usually it is visible in the header of the sequence file (usually even
if it is using compression).

I am not quite sure what you mean by "rowid" processing.



On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <pa...@gmail.com> wrote:
> Getting an odd error on SSVD.
>
> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini cluster in parallel. Most of them complete with no errors but there usually two map task failures for each QJob, they die with the error:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
>         at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:416)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> The data was created using seq2sparse and then running rowid to create the input matrix. The data was encoded as named vectors. These are the two differences I could think of between how I ran it from the API and from the CLI.
>
>
> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pa...@gmail.com> wrote:
>
> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs.
>
> No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.
>
> I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>
> Variance Retained
>
> If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no?
>
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>>
>> Switching from API to CLI
>>
>> the parameter -t is described in the PDF
>>
>> --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
>> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
>>
>> The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
>
>>
>> As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
>>
>>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
>>
>> Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>
>

Re: SSVD + PCA

Posted by Pat Ferrel <pa...@gmail.com>.

Getting an odd error on SSVD. 

Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini cluster in parallel. Most of them complete with no errors but there usually two map task failures for each QJob, they die with the error:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
	at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:416)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

The data was created using seq2sparse and then running rowid to create the input matrix. The data was encoded as named vectors. These are the two differences I could think of between how I ran it from the API and from the CLI.

On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pa...@gmail.com> wrote:

-t Param

I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs. 

No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.

I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.

Variance Retained

If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no? 

On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> 
> Switching from API to CLI
> 
> the parameter -t is described in the PDF
> 
> --reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it
could also be overwritten by a standard hadoop property using -D option
> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
to 1, which is certainly far below the cluster capacity. Recommended value
for this option ~ 95% or ~190% of available reducer capacity to allow for
opportunistic executions.
> 
> The description above seems to say it will be taken from the hadoop
config if not specified, which is probably all most people would every
want. I am unclear why this is needed? I cannot run SSVD without specifying
it, in other words it does not seem to be optional?

This parameter was made mandatory because people were repeatedly forgetting
set the number of reducers and kept coming back with questions like why it
is running so slow. So there was an issue in 0.7 where i made it mandatory.
I am actually not sure now other mahout methods ensure reducer
specification is always specified other than 1

> 
> As a first try using the CLI I'm running with 295625 rows and 337258
columns using the following parameters to get a sort of worst case run time
result with best case data output. The parameters will be tweaked later to
get better dimensional reduction and runtime.
> 
>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
on cluster)
> 
> Is there work being done to calculate the variance retained for the
output or should I calculate it myself?

No theres no work done since it implies your are building your own pipeline
for a particular purpose. It also takes a lot of assumptions that may or
may not hold in a  particular case, such that you do something repeatedly
and corpuses are of similar nature. Also, i know no paper that would do it
exactly the way i described, so theres no error estimate on either
inequality approach or any sort of decay interpolation.

It is not very difficult to experiment a little with your data though with
a subset of the corpus and see what may work.

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Aug 21, 2012 8:52 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
> RE:  the -t param. I understand your point but the requirement seems
counter to the philosophy of Hadoop where it's the responsibility of the
hadoop cluster admin to determine the number of jops, tasks, mappers, and
reducers that can run on the cluster or any node. Some tweaking is required
for specific jobs but that is not why you have the -t. As far as I know
there is nothing special about the SSVD job reducers.

Having too many reducers may still cause qr blocking deficiency in reduce
tasks of power iterations and potentially mappers of the next stack. This
needs to be controlled explicitly esp as i said in large clusters.

Requiring -t forces a user to change their scripts (is it required in the
API too? now that would be bad)

No the api does not require it set

every time the cluster config changes or when running on a different
cluster. And as you say, if people don't understand the use of multiple
reducers they will not understand the -t anyway.
>
> I'd vote to make the param optional especially in the API. I personally
would rather leave it up to the hadoop config to determine.
>
> BTW if anyone else is reading this, the SSVD ran remarkably fast on my
micro cluster (8 cores in two machines) for 295625 docs and 337258 terms
even with worst case parameters. I don't think it took a complete quarter
of football to finish (50 minutes actually), which gave me something to
cheer about  :-P

Yes it is usually quite fast and accurate on modestly sized problems. One
known bottleneck is mapper side matrix multiplication during power
iterations on super sparse matrices. If your input dimensions are quite
large and the problem is supersparse it may become sensitive to available
memory in mappers of power iterations (q>0). Once it starts scratching the
swap partition in disk due to memory pressures it may actually run quite
slowly in power iteration phase(ABt job)

>
>
> On Aug 20, 2012, at 8:23 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Aug 19, 2012 1:06 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >
> > -t Param
> >
> > I'm no hadoop expert but there are a couple parameters for each node in
a
> cluster that specifies the default number of mappers and reducers for that
> node. There is a rule of thumb about how many mappers and reducers per
> core. You can tweak them either way depending on your typical jobs.
> >
> > No idea what you mean about the total reducers being 1 for most configs.
> My very small cluster at home with 10 cores in three machines is
configured
> to produce a conservative 10 mappers and 10 reducers, which is about what
> happens with balanced jobs. The reducers = 1 is probably for a
> non-clustered one machine setup.
>
> Yes i agree i was thinking the same and relying on people doing the right
> thing initially. And the life proved me wrong. Absolutely all crews who
> tried the method, not only did they not have reducers set up in their
local
> client conf, but also they failed to use t parameter to fix it. Also they
> all failed to diagnose it on their own (i.e. simply noticing it in the job
> stats). I think it has something to do with a typical background of our
> customer.
>
> >
> > I'm suspicious that the -t  parameter is not needed but would definitely
> defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>
> Recommended value is 95% of the cluster capacity to leave space for
> opportunistic execution. Although on bigger clustes, i am far from sure
> that too many reducers may be that beneficial for a particular problem.
> Hence again override of default in command line may be useful.
>
> Also one usually ha more than 1 task capacity per node, so i would expect
> your cluster to be able to run up to 40 reducers, typically
> >
> > Variance Retained
> >
> > If one batch of data yields a greatly different estimate of VR than
> another, it would be worth noticing, even if we don't know the actual
error
> in it. To say that your estimate of VR is valueless would require that we
> have some experience with it, no?
>
> I am not saying it is valueless. Actually i am hoping it is useful, or i
> wouldnt inckude it in the howto. I am just saying it is something i leave
> outside the scope of the method itself.
>
> >
> > On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
> >
> > On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >>
> >> Switching from API to CLI
> >>
> >> the parameter -t is described in the PDF
> >>
> >> --reduceTasks <int-value> optional. The number of reducers to use
(where
> > applicable): depends on the size of the hadoop cluster. At this point it
> > could also be overwritten by a standard hadoop property using -D option
> >> 4. Probably always needs to be speciﬁed as by default Hadoop would set
> it
> > to 1, which is certainly far below the cluster capacity. Recommended
value
> > for this option ~ 95% or ~190% of available reducer capacity to allow
for
> > opportunistic executions.
> >>
> >> The description above seems to say it will be taken from the hadoop
> > config if not specified, which is probably all most people would every
> > want. I am unclear why this is needed? I cannot run SSVD without
> specifying
> > it, in other words it does not seem to be optional?
> >
> > This parameter was made mandatory because people were repeatedly
> forgetting
> > set the number of reducers and kept coming back with questions like why
it
> > is running so slow. So there was an issue in 0.7 where i made it
> mandatory.
> > I am actually not sure now other mahout methods ensure reducer
> > specification is always specified other than 1
> >
> >>
> >> As a first try using the CLI I'm running with 295625 rows and 337258
> > columns using the following parameters to get a sort of worst case run
> time
> > result with best case data output. The parameters will be tweaked later
to
> > get better dimensional reduction and runtime.
> >>
> >>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> > on cluster)
> >>
> >> Is there work being done to calculate the variance retained for the
> > output or should I calculate it myself?
> >
> > No theres no work done since it implies your are building your own
> pipeline
> > for a particular purpose. It also takes a lot of assumptions that may or
> > may not hold in a  particular case, such that you do something
repeatedly
> > and corpuses are of similar nature. Also, i know no paper that would do
it
> > exactly the way i described, so theres no error estimate on either
> > inequality approach or any sort of decay interpolation.
> >
> > It is not very difficult to experiment a little with your data though
with
> > a subset of the corpus and see what may work.
> >
>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Aug 21, 2012 8:52 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
> RE:  the -t param. I understand your point but the requirement seems
counter to the philosophy of Hadoop where it's the responsibility of the
hadoop cluster admin to determine the number of jops, tasks, mappers, and
reducers that can run on the cluster or any node. Some tweaking is required
for specific jobs but that is not why you have the -t. As far as I know
there is nothing special about the SSVD job reducers.

Having too many reducers may still cause qr blocking deficiency in reduce
tasks of power iterations and potentially mappers of the next stack. This
needs to be controlled explicitly esp as i said in large clusters.

Requiring -t forces a user to change their scripts (is it required in the
API too? now that would be bad)

No the api does not require it set

every time the cluster config changes or when running on a different
cluster. And as you say, if people don't understand the use of multiple
reducers they will not understand the -t anyway.
>
> I'd vote to make the param optional especially in the API. I personally
would rather leave it up to the hadoop config to determine.
>
> BTW if anyone else is reading this, the SSVD ran remarkably fast on my
micro cluster (8 cores in two machines) for 295625 docs and 337258 terms
even with worst case parameters. I don't think it took a complete quarter
of football to finish (50 minutes actually), which gave me something to
cheer about  :-P

Yes it is usually quite fast and accurate on modestly sized problems. One
known bottleneck is mapper side matrix multiplication during power
iterations on super sparse matrices. If your input dimensions are quite
large and the problem is supersparse it may become sensitive to available
memory in mappers of power iterations (q>0). Once it starts scratching the
swap partition in disk due to memory pressures it may actually run quite
slowly in power iteration phase(ABt job)

>
>
> On Aug 20, 2012, at 8:23 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Aug 19, 2012 1:06 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >
> > -t Param
> >
> > I'm no hadoop expert but there are a couple parameters for each node in
a
> cluster that specifies the default number of mappers and reducers for that
> node. There is a rule of thumb about how many mappers and reducers per
> core. You can tweak them either way depending on your typical jobs.
> >
> > No idea what you mean about the total reducers being 1 for most configs.
> My very small cluster at home with 10 cores in three machines is
configured
> to produce a conservative 10 mappers and 10 reducers, which is about what
> happens with balanced jobs. The reducers = 1 is probably for a
> non-clustered one machine setup.
>
> Yes i agree i was thinking the same and relying on people doing the right
> thing initially. And the life proved me wrong. Absolutely all crews who
> tried the method, not only did they not have reducers set up in their
local
> client conf, but also they failed to use t parameter to fix it. Also they
> all failed to diagnose it on their own (i.e. simply noticing it in the job
> stats). I think it has something to do with a typical background of our
> customer.
>
> >
> > I'm suspicious that the -t  parameter is not needed but would definitely
> defer to a hadoop master. In any case I set it to 10 for my mini cluster.
>
> Recommended value is 95% of the cluster capacity to leave space for
> opportunistic execution. Although on bigger clustes, i am far from sure
> that too many reducers may be that beneficial for a particular problem.
> Hence again override of default in command line may be useful.
>
> Also one usually ha more than 1 task capacity per node, so i would expect
> your cluster to be able to run up to 40 reducers, typically
> >
> > Variance Retained
> >
> > If one batch of data yields a greatly different estimate of VR than
> another, it would be worth noticing, even if we don't know the actual
error
> in it. To say that your estimate of VR is valueless would require that we
> have some experience with it, no?
>
> I am not saying it is valueless. Actually i am hoping it is useful, or i
> wouldnt inckude it in the howto. I am just saying it is something i leave
> outside the scope of the method itself.
>
> >
> > On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:
> >
> > On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >>
> >> Switching from API to CLI
> >>
> >> the parameter -t is described in the PDF
> >>
> >> --reduceTasks <int-value> optional. The number of reducers to use
(where
> > applicable): depends on the size of the hadoop cluster. At this point it
> > could also be overwritten by a standard hadoop property using -D option
> >> 4. Probably always needs to be speciﬁed as by default Hadoop would set
> it
> > to 1, which is certainly far below the cluster capacity. Recommended
value
> > for this option ~ 95% or ~190% of available reducer capacity to allow
for
> > opportunistic executions.
> >>
> >> The description above seems to say it will be taken from the hadoop
> > config if not specified, which is probably all most people would every
> > want. I am unclear why this is needed? I cannot run SSVD without
> specifying
> > it, in other words it does not seem to be optional?
> >
> > This parameter was made mandatory because people were repeatedly
> forgetting
> > set the number of reducers and kept coming back with questions like why
it
> > is running so slow. So there was an issue in 0.7 where i made it
> mandatory.
> > I am actually not sure now other mahout methods ensure reducer
> > specification is always specified other than 1
> >
> >>
> >> As a first try using the CLI I'm running with 295625 rows and 337258
> > columns using the following parameters to get a sort of worst case run
> time
> > result with best case data output. The parameters will be tweaked later
to
> > get better dimensional reduction and runtime.
> >>
> >>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> > on cluster)
> >>
> >> Is there work being done to calculate the variance retained for the
> > output or should I calculate it myself?
> >
> > No theres no work done since it implies your are building your own
> pipeline
> > for a particular purpose. It also takes a lot of assumptions that may or
> > may not hold in a  particular case, such that you do something
repeatedly
> > and corpuses are of similar nature. Also, i know no paper that would do
it
> > exactly the way i described, so theres no error estimate on either
> > inequality approach or any sort of decay interpolation.
> >
> > It is not very difficult to experiment a little with your data though
with
> > a subset of the corpus and see what may work.
> >
>

Re: SSVD + PCA

Posted by Pat Ferrel <pa...@gmail.com>.

RE:  the -t param. I understand your point but the requirement seems counter to the philosophy of Hadoop where it's the responsibility of the hadoop cluster admin to determine the number of jops, tasks, mappers, and reducers that can run on the cluster or any node. Some tweaking is required for specific jobs but that is not why you have the -t. As far as I know there is nothing special about the SSVD job reducers. Requiring -t forces a user to change their scripts (is it required in the API too? now that would be bad) every time the cluster config changes or when running on a different cluster. And as you say, if people don't understand the use of multiple reducers they will not understand the -t anyway. 

I'd vote to make the param optional especially in the API. I personally would rather leave it up to the hadoop config to determine.

BTW if anyone else is reading this, the SSVD ran remarkably fast on my micro cluster (8 cores in two machines) for 295625 docs and 337258 terms even with worst case parameters. I don't think it took a complete quarter of football to finish (50 minutes actually), which gave me something to cheer about  :-P

On Aug 20, 2012, at 8:23 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Aug 19, 2012 1:06 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> 
> -t Param
> 
> I'm no hadoop expert but there are a couple parameters for each node in a
cluster that specifies the default number of mappers and reducers for that
node. There is a rule of thumb about how many mappers and reducers per
core. You can tweak them either way depending on your typical jobs.
> 
> No idea what you mean about the total reducers being 1 for most configs.
My very small cluster at home with 10 cores in three machines is configured
to produce a conservative 10 mappers and 10 reducers, which is about what
happens with balanced jobs. The reducers = 1 is probably for a
non-clustered one machine setup.

Yes i agree i was thinking the same and relying on people doing the right
thing initially. And the life proved me wrong. Absolutely all crews who
tried the method, not only did they not have reducers set up in their local
client conf, but also they failed to use t parameter to fix it. Also they
all failed to diagnose it on their own (i.e. simply noticing it in the job
stats). I think it has something to do with a typical background of our
customer.

> 
> I'm suspicious that the -t  parameter is not needed but would definitely
defer to a hadoop master. In any case I set it to 10 for my mini cluster.

Recommended value is 95% of the cluster capacity to leave space for
opportunistic execution. Although on bigger clustes, i am far from sure
that too many reducers may be that beneficial for a particular problem.
Hence again override of default in command line may be useful.

Also one usually ha more than 1 task capacity per node, so i would expect
your cluster to be able to run up to 40 reducers, typically
> 
> Variance Retained
> 
> If one batch of data yields a greatly different estimate of VR than
another, it would be worth noticing, even if we don't know the actual error
in it. To say that your estimate of VR is valueless would require that we
have some experience with it, no?

I am not saying it is valueless. Actually i am hoping it is useful, or i
wouldnt inckude it in the howto. I am just saying it is something i leave
outside the scope of the method itself.

> 
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>> 
>> Switching from API to CLI
>> 
>> the parameter -t is described in the PDF
>> 
>> --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
>> 4. Probably always needs to be speciﬁed as by default Hadoop would set
it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
>> 
>> The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without
specifying
> it, in other words it does not seem to be optional?
> 
> This parameter was made mandatory because people were repeatedly
forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it
mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
> 
>> 
>> As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run
time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
>> 
>>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
>> 
>> Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
> 
> No theres no work done since it implies your are building your own
pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
> 
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Aug 19, 2012 1:06 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
> -t Param
>
> I'm no hadoop expert but there are a couple parameters for each node in a
cluster that specifies the default number of mappers and reducers for that
node. There is a rule of thumb about how many mappers and reducers per
core. You can tweak them either way depending on your typical jobs.
>
> No idea what you mean about the total reducers being 1 for most configs.
My very small cluster at home with 10 cores in three machines is configured
to produce a conservative 10 mappers and 10 reducers, which is about what
happens with balanced jobs. The reducers = 1 is probably for a
non-clustered one machine setup.

Yes i agree i was thinking the same and relying on people doing the right
thing initially. And the life proved me wrong. Absolutely all crews who
tried the method, not only did they not have reducers set up in their local
client conf, but also they failed to use t parameter to fix it. Also they
all failed to diagnose it on their own (i.e. simply noticing it in the job
stats). I think it has something to do with a typical background of our
customer.

>
> I'm suspicious that the -t  parameter is not needed but would definitely
defer to a hadoop master. In any case I set it to 10 for my mini cluster.

Recommended value is 95% of the cluster capacity to leave space for
opportunistic execution. Although on bigger clustes, i am far from sure
that too many reducers may be that beneficial for a particular problem.
Hence again override of default in command line may be useful.

Also one usually ha more than 1 task capacity per node, so i would expect
your cluster to be able to run up to 40 reducers, typically
>
> Variance Retained
>
> If one batch of data yields a greatly different estimate of VR than
another, it would be worth noticing, even if we don't know the actual error
in it. To say that your estimate of VR is valueless would require that we
have some experience with it, no?

I am not saying it is valueless. Actually i am hoping it is useful, or i
wouldnt inckude it in the howto. I am just saying it is something i leave
outside the scope of the method itself.

>
> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> >
> > Switching from API to CLI
> >
> > the parameter -t is described in the PDF
> >
> > --reduceTasks <int-value> optional. The number of reducers to use (where
> applicable): depends on the size of the hadoop cluster. At this point it
> could also be overwritten by a standard hadoop property using -D option
> > 4. Probably always needs to be speciﬁed as by default Hadoop would set
it
> to 1, which is certainly far below the cluster capacity. Recommended value
> for this option ~ 95% or ~190% of available reducer capacity to allow for
> opportunistic executions.
> >
> > The description above seems to say it will be taken from the hadoop
> config if not specified, which is probably all most people would every
> want. I am unclear why this is needed? I cannot run SSVD without
specifying
> it, in other words it does not seem to be optional?
>
> This parameter was made mandatory because people were repeatedly
forgetting
> set the number of reducers and kept coming back with questions like why it
> is running so slow. So there was an issue in 0.7 where i made it
mandatory.
> I am actually not sure now other mahout methods ensure reducer
> specification is always specified other than 1
>
> >
> > As a first try using the CLI I'm running with 295625 rows and 337258
> columns using the following parameters to get a sort of worst case run
time
> result with best case data output. The parameters will be tweaked later to
> get better dimensional reduction and runtime.
> >
> >    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
> on cluster)
> >
> > Is there work being done to calculate the variance retained for the
> output or should I calculate it myself?
>
> No theres no work done since it implies your are building your own
pipeline
> for a particular purpose. It also takes a lot of assumptions that may or
> may not hold in a  particular case, such that you do something repeatedly
> and corpuses are of similar nature. Also, i know no paper that would do it
> exactly the way i described, so theres no error estimate on either
> inequality approach or any sort of decay interpolation.
>
> It is not very difficult to experiment a little with your data though with
> a subset of the corpus and see what may work.
>

Re: SSVD + PCA

Posted by Pat Ferrel <pa...@gmail.com>.

-t Param

I'm no hadoop expert but there are a couple parameters for each node in a cluster that specifies the default number of mappers and reducers for that node. There is a rule of thumb about how many mappers and reducers per core. You can tweak them either way depending on your typical jobs. 

No idea what you mean about the total reducers being 1 for most configs. My very small cluster at home with 10 cores in three machines is configured to produce a conservative 10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers = 1 is probably for a non-clustered one machine setup.

I'm suspicious that the -t  parameter is not needed but would definitely defer to a hadoop master. In any case I set it to 10 for my mini cluster.

Variance Retained

If one batch of data yields a greatly different estimate of VR than another, it would be worth noticing, even if we don't know the actual error in it. To say that your estimate of VR is valueless would require that we have some experience with it, no? 

On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
> 
> Switching from API to CLI
> 
> the parameter -t is described in the PDF
> 
> --reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it
could also be overwritten by a standard hadoop property using -D option
> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
to 1, which is certainly far below the cluster capacity. Recommended value
for this option ~ 95% or ~190% of available reducer capacity to allow for
opportunistic executions.
> 
> The description above seems to say it will be taken from the hadoop
config if not specified, which is probably all most people would every
want. I am unclear why this is needed? I cannot run SSVD without specifying
it, in other words it does not seem to be optional?

This parameter was made mandatory because people were repeatedly forgetting
set the number of reducers and kept coming back with questions like why it
is running so slow. So there was an issue in 0.7 where i made it mandatory.
I am actually not sure now other mahout methods ensure reducer
specification is always specified other than 1

> 
> As a first try using the CLI I'm running with 295625 rows and 337258
columns using the following parameters to get a sort of worst case run time
result with best case data output. The parameters will be tweaked later to
get better dimensional reduction and runtime.
> 
>    mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
on cluster)
> 
> Is there work being done to calculate the variance retained for the
output or should I calculate it myself?

No theres no work done since it implies your are building your own pipeline
for a particular purpose. It also takes a lot of assumptions that may or
may not hold in a  particular case, such that you do something repeatedly
and corpuses are of similar nature. Also, i know no paper that would do it
exactly the way i described, so theres no error estimate on either
inequality approach or any sort of decay interpolation.

It is not very difficult to experiment a little with your data though with
a subset of the corpus and see what may work.

Re: SSVD + PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pa...@gmail.com> wrote:
>
> Switching from API to CLI
>
> the parameter -t is described in the PDF
>
> --reduceTasks <int-value> optional. The number of reducers to use (where
applicable): depends on the size of the hadoop cluster. At this point it
could also be overwritten by a standard hadoop property using -D option
> 4. Probably always needs to be speciﬁed as by default Hadoop would set it
to 1, which is certainly far below the cluster capacity. Recommended value
for this option ~ 95% or ~190% of available reducer capacity to allow for
opportunistic executions.
>
> The description above seems to say it will be taken from the hadoop
config if not specified, which is probably all most people would every
want. I am unclear why this is needed? I cannot run SSVD without specifying
it, in other words it does not seem to be optional?

This parameter was made mandatory because people were repeatedly forgetting
set the number of reducers and kept coming back with questions like why it
is running so slow. So there was an issue in 0.7 where i made it mandatory.
I am actually not sure now other mahout methods ensure reducer
specification is always specified other than 1

>
> As a first try using the CLI I'm running with 295625 rows and 337258
columns using the following parameters to get a sort of worst case run time
result with best case data output. The parameters will be tweaked later to
get better dimensional reduction and runtime.
>
>     mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
on cluster)
>
> Is there work being done to calculate the variance retained for the
output or should I calculate it myself?

No theres no work done since it implies your are building your own pipeline
for a particular purpose. It also takes a lot of assumptions that may or
may not hold in a  particular case, such that you do something repeatedly
and corpuses are of similar nature. Also, i know no paper that would do it
exactly the way i described, so theres no error estimate on either
inequality approach or any sort of decay interpolation.

It is not very difficult to experiment a little with your data though with
a subset of the corpus and see what may work.