You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Shannon Quinn (Created) (JIRA)" <ji...@apache.org> on 2012/03/01 21:35:59 UTC

[jira] [Created] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

OutOfMemoryError in LanczosState by way of SpectralKMeans
---------------------------------------------------------

                 Key: MAHOUT-986
                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.6
         Environment: Ubuntu 11.10 (64-bit)
            Reporter: Shannon Quinn
            Assignee: Shannon Quinn
            Priority: Minor
             Fix For: 0.7


Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:

{{Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)}}

Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Status: Patch Available  (was: Open)
    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch, MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288250#comment-13288250 ] 

Grant Ingersoll commented on MAHOUT-986:
----------------------------------------

{code}bin/mahout spectralkmeans -k 20 -d 4192499 -x 7 -i path/to/csv/file/ -o your/output/path/{code}
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Attachment: MAHOUT-986.patch
    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch, MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Attachment:     (was: MAHOUT-986.patch)
    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-986:
--------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288595#comment-13288595 ] 

Dan Brickley commented on MAHOUT-986:
-------------------------------------

Thanks for the fix. Testing it, I rediscover a problem from offlist thread w/ Shannon.

On my setup (pseudo-cluster on one oxs laptop, right now), if there is an _ / underscore in the input/ filename, I get this error:

2/06/04 07:21:52 INFO common.VectorCache: Loading vector from: spectral/output/calculations/diagonal/part-r-00000
Exception in thread "main" java.util.NoSuchElementException
	at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
	at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:115)
2/06/04 07:21:52 INFO common.VectorCache: Loading vector from: spectral/output/calculations/diagonal/part-r-00000
Exception in thread "main" java.util.NoSuchElementException
	at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
	at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:115)

This (seems to, I'm still waiting to confirm but the run got further than before) works for a fix to that; however as nobody else (including Shannon) seems to have hit this, I didn't raise a JIRA yet.

grunt> cd spectral
grunt> cd input
grunt> ls
hdfs://localhost:9000/user/danbri/spectral/input/_topic_skm.csv<r 1>	494017428
grunt> mv _topic_skm.csv wikitopics.csv
grunt> quit


I should also describe the test data: it's a bipartite graph of affinities between normal Wikipedia entries, and the Wikipedia categories that apply to them. The weight is boolean 0 or 1, and is (on Shannon's advice) expressed in both directions. So from the urldict.txt file, we see

0       http://dbpedia.org/resource/Albedo
1       http://dbpedia.org/resource/Category:Electromagnetic_radiation
2       http://dbpedia.org/resource/Category:Climatology
3       http://dbpedia.org/resource/Category:Climate_forcing
4       http://dbpedia.org/resource/Category:Scattering,_absorption_and_radiative_transfer_(optics)
5       http://dbpedia.org/resource/Category:Radiometry


which maps to 

0,1,1.0
1,0,1.0
0,2,1.0
2,0,1.0
0,3,1.0
3,0,1.0
0,4,1.0
4,0,1.0
0,5,1.0
5,0,1.0

...which captures the associations between http://dbpedia.org/resource/Albedo and the 5 topics assigned to it in Wikipedia/dbpedia.

Is this a good use of mahout's spectral clustering, specifically? Hard to say, ... didn't get past the entry gate yet. Maybe a more symmetric link graph would've been a better first test. 

See also http://mail-archives.apache.org/mod_mbox/mahout-user/201203.mbox/%3CCAFfrAFrFMNvvXEaFy0rSZT1tA=S6pCW-+YHrOhR=W5HGv_vP1Q@mail.gmail.com%3E 
http://comments.gmane.org/gmane.comp.apache.mahout.user/12346 ... it would also be interesting to try pulling second-smallest eigenvalue/vector from SSVD .

                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288240#comment-13288240 ] 

Hudson commented on MAHOUT-986:
-------------------------------

Integrated in Mahout-Quality #1517 (See [https://builds.apache.org/job/Mahout-Quality/1517/])
    MAHOUT-986 Remove old LDA implementation from codebase (Revision 1345736)

     Result = FAILURE
ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345736
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADocumentTopicMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDAInference.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDAReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDASampler.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDAState.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDAUtil.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDAWordTopicMapper.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/ClusteringTestUtils.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestLDAInference.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/lda/TestMapReduce.java
* /mahout/trunk/src/conf/driver.classes.props
* /mahout/trunk/src/conf/lda.props
* /mahout/trunk/src/conf/ldatopics.props

                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288650#comment-13288650 ] 

Hudson commented on MAHOUT-986:
-------------------------------

Integrated in Mahout-Quality #1525 (See [https://builds.apache.org/job/Mahout-Quality/1525/])
    MAHOUT-986 OutOfMemoryError in LanczosState by way of SpectralKMeans (Revision 1345978)

     Result = FAILURE
ssc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345978
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java

                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Attachment: MAHOUT-986.patch

Patch which implements Sebastian's fix.
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch, MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288310#comment-13288310 ] 

Hudson commented on MAHOUT-986:
-------------------------------

Integrated in Mahout-Quality #1521 (See [https://builds.apache.org/job/Mahout-Quality/1521/])
    MAHOUT-986 Put deprecated warning for lda and ldatopics (Revision 1345808)

     Result = FAILURE
robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345808
Files : 
* /mahout/trunk/src/conf/driver.classes.props

                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288222#comment-13288222 ] 

Grant Ingersoll commented on MAHOUT-986:
----------------------------------------

Do we  have a test case for this?  What's the setup here?  How much heap?
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288247#comment-13288247 ] 

Shannon Quinn commented on MAHOUT-986:
--------------------------------------

This issue was brought to my attention by Dan Brickley; this is his dataset from the dbpedia.org website. I believe it's a bunch of Wikipedia articles (4.19 million, to be exact) whose similarities to one another are weighted in one giant affinity matrix, and he was using spectral clustering to cluster the articles. I can contact him directly for a precise definition of the problem.
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Description: 
Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:

{quote}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
{quote}

Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

  was:
Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:

{{Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)}}

Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-986:
---------------------------------

    Comment: was deleted

(was: Patch which implements Sebastian's fix.)
    
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288605#comment-13288605 ] 

Sebastian Schelter commented on MAHOUT-986:
-------------------------------------------

Might be a Hadoop thing, maybe they have a special treatment for filenames starting with underscore (don't they use this for _SUCCESS)
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288245#comment-13288245 ] 

Grant Ingersoll commented on MAHOUT-986:
----------------------------------------

I'm downloading the dataset, but it would be good to at least document what the test is.
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288243#comment-13288243 ] 

Shannon Quinn commented on MAHOUT-986:
--------------------------------------

Ahhh, excellent catch Sebastian. I suspect that's the problem.

The test case was on my own setup: 4 Ubuntu VMs, each with 4GB memory (2GB heap, if I recall correctly) running a fully distributed Hadoop environment. I can't replicated that setup at the moment since I have not yet reassembled my computer from moving, but I'm attached the patch that implements the fix Sebastian suggested. Given the parallel constructor signatures of the LanczosState and DistributedLanczosSolver, I'm confident this is what we want.
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch, MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288198#comment-13288198 ] 

Sebastian Schelter commented on MAHOUT-986:
-------------------------------------------

I think that this is simply a typo in the code. The constructor of LanczosState expects the desiredRank (the rank of the low-rank approximation), yet it is invoked with the number of dimensions of the original data:

_LanczosState(VectorIterable corpus, int desiredRank, Vector initialVector)_

_LanczosState state = new LanczosState(L, numDims, solver.getInitialVector(L));_

Can you have a look at this, Shannon?


                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288621#comment-13288621 ] 

Dan Brickley commented on MAHOUT-986:
-------------------------------------

Digging around a bit more, yes, seems to be a Hadoop convention. And waking up a bit, I realise I opened a JIRA to track this already: https://issues.apache.org/jira/browse/MAHOUT-978
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-986) OutOfMemoryError in LanczosState by way of SpectralKMeans

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295660#comment-13295660 ] 

Dan Brickley commented on MAHOUT-986:
-------------------------------------

Does this hudson comment mean we still have an issue?
                
> OutOfMemoryError in LanczosState by way of SpectralKMeans
> ---------------------------------------------------------
>
>                 Key: MAHOUT-986
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-986
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: Ubuntu 11.10 (64-bit)
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: MAHOUT-986.patch
>
>
> Dan Brickley and I have been testing SpectralKMeans with a dbpedia dataset ( http://danbri.org/2012/spectral/dbpedia/ ); effectively, a graph with 4,192,499 nodes. Not surprisingly, the LanczosSolver throws an OutOfMemoryError when it attempts to instantiate a DenseMatrix of dimensions 4192499-by-4192499 (~17.5 trillion double-precision floating point values). Here's the full stack trace:
> {quote}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:50)
> 	at org.apache.mahout.math.decomposer.lanczos.LanczosState.<init>(LanczosState.java:45)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:146)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:86)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.main(SpectralKMeansDriver.java:53)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:616)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {quote}
> Obviously SKM needs a more sustainable and memory-efficient way of performing an eigen-decomposition of the graph laplacian. For those who are more knowledgeable in the linear systems solvers of Mahout than I, can the Lanczos parameters be tweaked to negate the requirement of a full DenseMatrix? Or should SKM move to SSVD instead?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira