You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ron Ayoub <ro...@live.com> on 2013/12/13 21:42:22 UTC

SSVD on very large and sparse matrices

I'm doing some up front research on implementing LSI and choice of tools. I understand Mahout provide an out-of-core implementation of Stochastic SVD. On the web site it use the words 'reasonable size problems'. Would a spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries be out of the question. If so, what tools out there can do that. For instance, ARPACK. Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the options out there that I have found. Thanks.

Re: SSVD on very large and sparse matrices

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS that's for p=100 "good" singular values . Of course k+p is the biggest
cost factor .


On Fri, Dec 13, 2013 at 1:50 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> PS if i am not mistaken crunching Nathan's data, his largest experiment
> (wiki-all) was for ~8B non-zero elements for a sparse matrix geometry of
> 37Mx38M and it took him 22 hours to compute on his setup (4 EC2 large
> worker nodes?) with 1 power iteration (quite good accuracy) but analytical
> extrapolation to 16-32 nodes looks fairly good to me for a problem of that
> size. ~30 machines is not anywhere an extraordinary cluster by any
> measurement today.
>
>
> On Fri, Dec 13, 2013 at 1:17 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>>
>>
>>
>> On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <ro...@live.com> wrote:
>>
>>> I'm doing some up front research on implementing LSI and choice of
>>> tools. I understand Mahout provide an out-of-core implementation of
>>> Stochastic SVD. On the web site it use the words 'reasonable size
>>> problems'. Would a spare matrix 1,000,000 * 1,000,000 having some
>>> 250,000,000 nonzero entries be out of the question.
>>
>>
>> for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and
>> on is so far the best source I know.
>>
>> Nathan compares performance and assesses bottlenecks on at least two
>> interesting data sets -- wiki and wiki-max. He is experience the bottleneck
>> in the matrix multiplication operation (but he may have done the testing
>> before certain improvements were applied to the matrix-matrix part of power
>> iterations -- i am still hazy on that).
>>
>> [1]
>> http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
>>
>> I have a great hope that this bottleneck could be further addressed by
>> punting MapReduce out of equation and replacing with Bagel or GraphX
>> broadcast operations in the upcoming Spark 0.9. I have plans to address
>> that with Mahout-on-Spark part of the code but I am still waiting for Spark
>> project to rehash its graph based computation approach (there's sense that
>> GraphX should be superior in broadcasting techniques than existing Bagel
>> api in Spark).
>>
>>
>>> If so, what tools out there can do that. For instance, ARPACK.
>>
>>
>> AFAIK nobody to date cared to do the comparisons with ARPACK
>>
>>
>>> Regardless, how does Mahout SSVD compare to ARPACK. These seems to be
>>> the options out there that I have found. Thanks.
>>>
>>
>>
>>
>

Re: SSVD on very large and sparse matrices

Posted by Sebastian Schelter <ss...@googlemail.com>.

Sidenote: I was able to process a matrix of 3B non-zeros in 3 hours on a
6 machine cluster with Mahout's SSVD

On 13.12.2013 22:50, Dmitriy Lyubimov wrote:
> PS if i am not mistaken crunching Nathan's data, his largest experiment
> (wiki-all) was for ~8B non-zero elements for a sparse matrix geometry of
> 37Mx38M and it took him 22 hours to compute on his setup (4 EC2 large
> worker nodes?) with 1 power iteration (quite good accuracy) but analytical
> extrapolation to 16-32 nodes looks fairly good to me for a problem of that
> size. ~30 machines is not anywhere an extraordinary cluster by any
> measurement today.
> 
> 
> On Fri, Dec 13, 2013 at 1:17 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
>>
>>
>>
>> On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <ro...@live.com> wrote:
>>
>>> I'm doing some up front research on implementing LSI and choice of tools.
>>> I understand Mahout provide an out-of-core implementation of Stochastic
>>> SVD. On the web site it use the words 'reasonable size problems'. Would a
>>> spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries
>>> be out of the question.
>>
>>
>> for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and
>> on is so far the best source I know.
>>
>> Nathan compares performance and assesses bottlenecks on at least two
>> interesting data sets -- wiki and wiki-max. He is experience the bottleneck
>> in the matrix multiplication operation (but he may have done the testing
>> before certain improvements were applied to the matrix-matrix part of power
>> iterations -- i am still hazy on that).
>>
>> [1]
>> http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
>>
>> I have a great hope that this bottleneck could be further addressed by
>> punting MapReduce out of equation and replacing with Bagel or GraphX
>> broadcast operations in the upcoming Spark 0.9. I have plans to address
>> that with Mahout-on-Spark part of the code but I am still waiting for Spark
>> project to rehash its graph based computation approach (there's sense that
>> GraphX should be superior in broadcasting techniques than existing Bagel
>> api in Spark).
>>
>>
>>> If so, what tools out there can do that. For instance, ARPACK.
>>
>>
>> AFAIK nobody to date cared to do the comparisons with ARPACK
>>
>>
>>> Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the
>>> options out there that I have found. Thanks.
>>>
>>
>>
>>
>

Re: SSVD on very large and sparse matrices

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

PS if i am not mistaken crunching Nathan's data, his largest experiment
(wiki-all) was for ~8B non-zero elements for a sparse matrix geometry of
37Mx38M and it took him 22 hours to compute on his setup (4 EC2 large
worker nodes?) with 1 power iteration (quite good accuracy) but analytical
extrapolation to 16-32 nodes looks fairly good to me for a problem of that
size. ~30 machines is not anywhere an extraordinary cluster by any
measurement today.


On Fri, Dec 13, 2013 at 1:17 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
>
> On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <ro...@live.com> wrote:
>
>> I'm doing some up front research on implementing LSI and choice of tools.
>> I understand Mahout provide an out-of-core implementation of Stochastic
>> SVD. On the web site it use the words 'reasonable size problems'. Would a
>> spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries
>> be out of the question.
>
>
> for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and
> on is so far the best source I know.
>
> Nathan compares performance and assesses bottlenecks on at least two
> interesting data sets -- wiki and wiki-max. He is experience the bottleneck
> in the matrix multiplication operation (but he may have done the testing
> before certain improvements were applied to the matrix-matrix part of power
> iterations -- i am still hazy on that).
>
> [1]
> http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
>
> I have a great hope that this bottleneck could be further addressed by
> punting MapReduce out of equation and replacing with Bagel or GraphX
> broadcast operations in the upcoming Spark 0.9. I have plans to address
> that with Mahout-on-Spark part of the code but I am still waiting for Spark
> project to rehash its graph based computation approach (there's sense that
> GraphX should be superior in broadcasting techniques than existing Bagel
> api in Spark).
>
>
>> If so, what tools out there can do that. For instance, ARPACK.
>
>
> AFAIK nobody to date cared to do the comparisons with ARPACK
>
>
>> Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the
>> options out there that I have found. Thanks.
>>
>
>
>

Re: SSVD on very large and sparse matrices

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Dec 13, 2013 at 12:42 PM, Ron Ayoub <ro...@live.com> wrote:

> I'm doing some up front research on implementing LSI and choice of tools.
> I understand Mahout provide an out-of-core implementation of Stochastic
> SVD. On the web site it use the words 'reasonable size problems'. Would a
> spare matrix 1,000,000 * 1,000,000 having some 250,000,000 nonzero entries
> be out of the question.

for performance/accuracy assessment Nathan's dissertation [1] pp. 139 and
on is so far the best source I know.

Nathan compares performance and assesses bottlenecks on at least two
interesting data sets -- wiki and wiki-max. He is experience the bottleneck
in the matrix multiplication operation (but he may have done the testing
before certain improvements were applied to the matrix-matrix part of power
iterations -- i am still hazy on that).

[1]
http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf

I have a great hope that this bottleneck could be further addressed by
punting MapReduce out of equation and replacing with Bagel or GraphX
broadcast operations in the upcoming Spark 0.9. I have plans to address
that with Mahout-on-Spark part of the code but I am still waiting for Spark
project to rehash its graph based computation approach (there's sense that
GraphX should be superior in broadcasting techniques than existing Bagel
api in Spark).

> If so, what tools out there can do that. For instance, ARPACK.

AFAIK nobody to date cared to do the comparisons with ARPACK

> Regardless, how does Mahout SSVD compare to ARPACK. These seems to be the
> options out there that I have found. Thanks.
>