You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by lei tang <fi...@gmail.com> on 2012/09/14 22:24:50 UTC

SSVD too slow to handle large matrix?

Hi,

I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
matrix (around 30M x 1M).    I used a modified script of
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
to store the input matrix with <key, value> pairs being integer, and
vectorwritable (in particular, SequentialAccessSparseVector). Should I
change to RandomAccessSparseVector?

I managed to run mahout SSVD with the following specification.
mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp

I specified the max split in order to have more mappers.  However, the
first Qjob seems not moving. After 1 hour, it is still 12% with 100
mappers.  Is this expected?  Should I change any parameter?

Any suggestion is highly appreciated.

- Lei
P.S.  I'm also reading the docs from
https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can figure
out why it is so slow.

Re: SSVD too slow to handle large matrix?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

also you can compare your performance experiments to Nathan Halko's
here: http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
pp. 110+...

They attempted a very large problems, as much as 726 splits by 512mb
with -k 100.  (default split size is what... 64mb?) They had a problem
tuning ABt job (as expected -- it looks like they had incredible
memory starvation and GC thrashing to do it quite efficiently) but
even that I am not quite sure if that was before performance patches
for ABt job. That problem it looks like took them almost a day to run
thru with -q1 -- and again, that mostly because ABt multiplication.
Extremely sparse problems will produce more problems for ABt whereas
densier problems are less prone to problems with q>0.

-d

On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> most importantly, what's your number of non-zero elements. (or input
> sequence file size).
>
> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Q job is actually the fastest and map-only.I'd say you drop all the
>> optional parameters (including p) and use mahout 0.7.
>>
>> Actually reducing split size is unlikely to help. Default split should be fine.
>>
>> i'd say running -k 10 on any sized input should result in Q mapper
>> task running in at most couple of minutes.
>>
>> using -k200 -p100 is fairly ambitious (mapper task running time will
>> scale a little worse then proportional to k+p).
>>
>> if you use -q1 you will likely to have more problems with ABt job and
>> that may require some memory tuning...
>>
>> otherwise check the usual things -- memory, cluster capacity (do you
>> actually have capacity running 100 mappers? Do they have at least 1G
>> of RAM on -Xmx without scratching the swap? Are you seeing GC
>> thrashing? etc.)
>>
>> That said your problem doesn't seem too big (judging from 100 mappers
>> with a regular split size, that should be ok). with -k 100 and default
>> p you should expect single q task to run about 20-25 minutes,
>> depending on your hardware. It is cpu-bound (or rather, mostly
>> fpu-bound, assuming you tackled memory issues etc.)
>>
>>
>> On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
>>> Hi,
>>>
>>> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>>> matrix (around 30M x 1M).    I used a modified script of
>>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>> to store the input matrix with <key, value> pairs being integer, and
>>> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>>> change to RandomAccessSparseVector?
>>>
>>> I managed to run mahout SSVD with the following specification.
>>> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>>> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp
>>>
>>> I specified the max split in order to have more mappers.  However, the
>>> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>>> mappers.  Is this expected?  Should I change any parameter?
>>>
>>> Any suggestion is highly appreciated.
>>>
>>> - Lei
>>> P.S.  I'm also reading the docs from
>>> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can figure
>>> out why it is so slow.

Re: SSVD too slow to handle large matrix?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

for that problem, like i said, just drop all the defaults.

most importantly, give child processes enough memory without hitting
the swap. Hadoop default used to be 200m only (don't know about now).
That surely will cause GC thrashing and slow turnaround (if it goes
thru at all).

On Fri, Sep 14, 2012 at 2:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> yeah sounds that something is wrong. 300mb is not huge. I have
> problems of around 2G input on 10 nodes and it doesn't take that long
> at all. Another researcher i knew was doign something similar in
> vicinity i think of 4-5B non-zeros.
>
> On Fri, Sep 14, 2012 at 2:41 PM, lei tang <fi...@gmail.com> wrote:
>> there are around 100M non-zero entries.  The sequence file size is not that
>> huge, around 300M bytes.
>>
>> i'll check out your other options to see what is wrong.
>>
>> - Lei
>>
>> On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> most importantly, what's your number of non-zero elements. (or input
>>> sequence file size).
>>>
>>> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dl...@gmail.com>
>>> wrote:
>>> > Q job is actually the fastest and map-only.I'd say you drop all the
>>> > optional parameters (including p) and use mahout 0.7.
>>> >
>>> > Actually reducing split size is unlikely to help. Default split should
>>> be fine.
>>> >
>>> > i'd say running -k 10 on any sized input should result in Q mapper
>>> > task running in at most couple of minutes.
>>> >
>>> > using -k200 -p100 is fairly ambitious (mapper task running time will
>>> > scale a little worse then proportional to k+p).
>>> >
>>> > if you use -q1 you will likely to have more problems with ABt job and
>>> > that may require some memory tuning...
>>> >
>>> > otherwise check the usual things -- memory, cluster capacity (do you
>>> > actually have capacity running 100 mappers? Do they have at least 1G
>>> > of RAM on -Xmx without scratching the swap? Are you seeing GC
>>> > thrashing? etc.)
>>> >
>>> > That said your problem doesn't seem too big (judging from 100 mappers
>>> > with a regular split size, that should be ok). with -k 100 and default
>>> > p you should expect single q task to run about 20-25 minutes,
>>> > depending on your hardware. It is cpu-bound (or rather, mostly
>>> > fpu-bound, assuming you tackled memory issues etc.)
>>> >
>>> >
>>> > On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
>>> >> Hi,
>>> >>
>>> >> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>>> >> matrix (around 30M x 1M).    I used a modified script of
>>> >>
>>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>> >> to store the input matrix with <key, value> pairs being integer, and
>>> >> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>>> >> change to RandomAccessSparseVector?
>>> >>
>>> >> I managed to run mahout SSVD with the following specification.
>>> >> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>>> >> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir
>>> mf/tmp
>>> >>
>>> >> I specified the max split in order to have more mappers.  However, the
>>> >> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>>> >> mappers.  Is this expected?  Should I change any parameter?
>>> >>
>>> >> Any suggestion is highly appreciated.
>>> >>
>>> >> - Lei
>>> >> P.S.  I'm also reading the docs from
>>> >> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can
>>> figure
>>> >> out why it is so slow.
>>>

Re: SSVD too slow to handle large matrix?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

yeah sounds that something is wrong. 300mb is not huge. I have
problems of around 2G input on 10 nodes and it doesn't take that long
at all. Another researcher i knew was doign something similar in
vicinity i think of 4-5B non-zeros.

On Fri, Sep 14, 2012 at 2:41 PM, lei tang <fi...@gmail.com> wrote:
> there are around 100M non-zero entries.  The sequence file size is not that
> huge, around 300M bytes.
>
> i'll check out your other options to see what is wrong.
>
> - Lei
>
> On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> most importantly, what's your number of non-zero elements. (or input
>> sequence file size).
>>
>> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > Q job is actually the fastest and map-only.I'd say you drop all the
>> > optional parameters (including p) and use mahout 0.7.
>> >
>> > Actually reducing split size is unlikely to help. Default split should
>> be fine.
>> >
>> > i'd say running -k 10 on any sized input should result in Q mapper
>> > task running in at most couple of minutes.
>> >
>> > using -k200 -p100 is fairly ambitious (mapper task running time will
>> > scale a little worse then proportional to k+p).
>> >
>> > if you use -q1 you will likely to have more problems with ABt job and
>> > that may require some memory tuning...
>> >
>> > otherwise check the usual things -- memory, cluster capacity (do you
>> > actually have capacity running 100 mappers? Do they have at least 1G
>> > of RAM on -Xmx without scratching the swap? Are you seeing GC
>> > thrashing? etc.)
>> >
>> > That said your problem doesn't seem too big (judging from 100 mappers
>> > with a regular split size, that should be ok). with -k 100 and default
>> > p you should expect single q task to run about 20-25 minutes,
>> > depending on your hardware. It is cpu-bound (or rather, mostly
>> > fpu-bound, assuming you tackled memory issues etc.)
>> >
>> >
>> > On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
>> >> Hi,
>> >>
>> >> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>> >> matrix (around 30M x 1M).    I used a modified script of
>> >>
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>> >> to store the input matrix with <key, value> pairs being integer, and
>> >> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>> >> change to RandomAccessSparseVector?
>> >>
>> >> I managed to run mahout SSVD with the following specification.
>> >> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>> >> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir
>> mf/tmp
>> >>
>> >> I specified the max split in order to have more mappers.  However, the
>> >> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>> >> mappers.  Is this expected?  Should I change any parameter?
>> >>
>> >> Any suggestion is highly appreciated.
>> >>
>> >> - Lei
>> >> P.S.  I'm also reading the docs from
>> >> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can
>> figure
>> >> out why it is so slow.
>>

Re: SSVD too slow to handle large matrix?

Posted by lei tang <fi...@gmail.com>.

there are around 100M non-zero entries.  The sequence file size is not that
huge, around 300M bytes.

i'll check out your other options to see what is wrong.

- Lei

On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> most importantly, what's your number of non-zero elements. (or input
> sequence file size).
>
> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > Q job is actually the fastest and map-only.I'd say you drop all the
> > optional parameters (including p) and use mahout 0.7.
> >
> > Actually reducing split size is unlikely to help. Default split should
> be fine.
> >
> > i'd say running -k 10 on any sized input should result in Q mapper
> > task running in at most couple of minutes.
> >
> > using -k200 -p100 is fairly ambitious (mapper task running time will
> > scale a little worse then proportional to k+p).
> >
> > if you use -q1 you will likely to have more problems with ABt job and
> > that may require some memory tuning...
> >
> > otherwise check the usual things -- memory, cluster capacity (do you
> > actually have capacity running 100 mappers? Do they have at least 1G
> > of RAM on -Xmx without scratching the swap? Are you seeing GC
> > thrashing? etc.)
> >
> > That said your problem doesn't seem too big (judging from 100 mappers
> > with a regular split size, that should be ok). with -k 100 and default
> > p you should expect single q task to run about 20-25 minutes,
> > depending on your hardware. It is cpu-bound (or rather, mostly
> > fpu-bound, assuming you tackled memory issues etc.)
> >
> >
> > On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
> >> Hi,
> >>
> >> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
> >> matrix (around 30M x 1M).    I used a modified script of
> >>
> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
> >> to store the input matrix with <key, value> pairs being integer, and
> >> vectorwritable (in particular, SequentialAccessSparseVector). Should I
> >> change to RandomAccessSparseVector?
> >>
> >> I managed to run mahout SSVD with the following specification.
> >> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
> >> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir
> mf/tmp
> >>
> >> I specified the max split in order to have more mappers.  However, the
> >> first Qjob seems not moving. After 1 hour, it is still 12% with 100
> >> mappers.  Is this expected?  Should I change any parameter?
> >>
> >> Any suggestion is highly appreciated.
> >>
> >> - Lei
> >> P.S.  I'm also reading the docs from
> >> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can
> figure
> >> out why it is so slow.
>

Re: SSVD too slow to handle large matrix?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

most importantly, what's your number of non-zero elements. (or input
sequence file size).

On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Q job is actually the fastest and map-only.I'd say you drop all the
> optional parameters (including p) and use mahout 0.7.
>
> Actually reducing split size is unlikely to help. Default split should be fine.
>
> i'd say running -k 10 on any sized input should result in Q mapper
> task running in at most couple of minutes.
>
> using -k200 -p100 is fairly ambitious (mapper task running time will
> scale a little worse then proportional to k+p).
>
> if you use -q1 you will likely to have more problems with ABt job and
> that may require some memory tuning...
>
> otherwise check the usual things -- memory, cluster capacity (do you
> actually have capacity running 100 mappers? Do they have at least 1G
> of RAM on -Xmx without scratching the swap? Are you seeing GC
> thrashing? etc.)
>
> That said your problem doesn't seem too big (judging from 100 mappers
> with a regular split size, that should be ok). with -k 100 and default
> p you should expect single q task to run about 20-25 minutes,
> depending on your hardware. It is cpu-bound (or rather, mostly
> fpu-bound, assuming you tackled memory issues etc.)
>
>
> On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
>> Hi,
>>
>> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>> matrix (around 30M x 1M).    I used a modified script of
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>> to store the input matrix with <key, value> pairs being integer, and
>> vectorwritable (in particular, SequentialAccessSparseVector). Should I
>> change to RandomAccessSparseVector?
>>
>> I managed to run mahout SSVD with the following specification.
>> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp
>>
>> I specified the max split in order to have more mappers.  However, the
>> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>> mappers.  Is this expected?  Should I change any parameter?
>>
>> Any suggestion is highly appreciated.
>>
>> - Lei
>> P.S.  I'm also reading the docs from
>> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can figure
>> out why it is so slow.

Re: SSVD too slow to handle large matrix?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Q job is actually the fastest and map-only.I'd say you drop all the
optional parameters (including p) and use mahout 0.7.

Actually reducing split size is unlikely to help. Default split should be fine.

i'd say running -k 10 on any sized input should result in Q mapper
task running in at most couple of minutes.

using -k200 -p100 is fairly ambitious (mapper task running time will
scale a little worse then proportional to k+p).

if you use -q1 you will likely to have more problems with ABt job and
that may require some memory tuning...

otherwise check the usual things -- memory, cluster capacity (do you
actually have capacity running 100 mappers? Do they have at least 1G
of RAM on -Xmx without scratching the swap? Are you seeing GC
thrashing? etc.)

That said your problem doesn't seem too big (judging from 100 mappers
with a regular split size, that should be ok). with -k 100 and default
p you should expect single q task to run about 20-25 minutes,
depending on your hardware. It is cpu-bound (or rather, mostly
fpu-bound, assuming you tackled memory issues etc.)

On Fri, Sep 14, 2012 at 1:24 PM, lei tang <fi...@gmail.com> wrote:
> Hi,
>
> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
> matrix (around 30M x 1M).    I used a modified script of
> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
> to store the input matrix with <key, value> pairs being integer, and
> vectorwritable (in particular, SequentialAccessSparseVector). Should I
> change to RandomAccessSparseVector?
>
> I managed to run mahout SSVD with the following specification.
> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir mf/tmp
>
> I specified the max split in order to have more mappers.  However, the
> first Qjob seems not moving. After 1 hour, it is still 12% with 100
> mappers.  Is this expected?  Should I change any parameter?
>
> Any suggestion is highly appreciated.
>
> - Lei
> P.S.  I'm also reading the docs from
> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can figure
> out why it is so slow.