You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Nathan Halko (Commented) (JIRA)" <ji...@apache.org> on 2011/12/30 08:57:30 UTC

[jira] [Commented] (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

    [ https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177583#comment-13177583 ] 

Nathan Halko commented on MAHOUT-308:
-------------------------------------

Not sure if this is the right place for this but I have been searching elsewhere without success.  

Setup:  Finding 150 singular values for 1e6 x 1e6 matrix, super sparse (2G on disk)

I'm getting Java Heap errors using Lanczos svd in SNAPSHOT-0.6.  The way I interpret the code, specifying a --workingDir uses HdfsBackedLanczosState which stores each basis vector in dfs (which I can see that they live there).  When the vectors are needed (orthogonalization and projecting the eignevectors) it seems they are read from disk one by one with only a few dense vectors in memory at one time (current, basis vector i, and an accumulation vector).  This should have very light mem requirements and hammer the network, however, I'm not seeing this behavior.  

Is this a known issue, a memory leak or something?  Is there something behind the scenes that keeps these vectors in memory?  I can't come to grips with the error below.


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.lang.Object.clone(Native Method)
	at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
	at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
	at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:99)
	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
	at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:151)
	at org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
	at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
	at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
	at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
	at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:123)
	at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

                
> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-308
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.5
>
>         Attachments: MAHOUT-308.patch
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while Hadoop is iterating.  The memory requirements of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a few hundred, starts to cap out usefulness at some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the Lanczos basis to disk, except for the most recent two vectors.  Some care must be taken in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis.  This part would be slower this way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira