You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Avery Ching (Created) (JIRA)" <ji...@apache.org> on 2011/11/16 23:03:51 UTC

[jira] [Created] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
-----------------------------------------------------------------------------------------------

                 Key: GIRAPH-91
                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
             Project: Giraph
          Issue Type: Improvement
            Reporter: Avery Ching


Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151793#comment-13151793 ] 

Avery Ching commented on GIRAPH-91:
-----------------------------------

Thanks for commenting Claudio, however, I can't commit until you or another committer +1's it. =)
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Claudio Martella (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151643#comment-13151643 ] 

Claudio Martella commented on GIRAPH-91:
----------------------------------------

The List-based adjacency list looks quite good to me. A couple of weeks ago I did a microbenchmark on iteration-performance of arrayList/array, TreeMap, HashMap and SkipList and I was quite impressed about the performance hit. I believe we don't only save memory here (would be curious to calculate precisely the overhead) but also in speedup with algorithms, such as PR, where the compute has an iterator-based sendMsg pattern. Good!
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151587#comment-13151587 ] 

jiraposter@reviews.apache.org commented on GIRAPH-91:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2868/
-----------------------------------------------------------

Review request for giraph.


Summary
-------

There general changes should support larger heap sizes (i.e. >20G)

- Added new EdgeListVertex that stores its edges in a compact pair of lists instead of Vertex's HashMap.

- Added unittests TestEdgeArrayVertex to test EdgeListVertex.

- Augmented PageRankBenchmark to choose between EdgeListArrayVertex or Vertex (to try it out).

- Added failure cleanup for failed workers to quickly alert the master that they are dead by deleting its health ephemeral znode.  This allows us to set higher ZooKeeper timeouts to deal with GC pauses and the like.  In a quick test of 3 nodes, I saw failure in 43 seconds instead of 1m 52 sec.

- Added a context.progress() to flushing to not kill jobs with long timeouts (GC or lots of messages).


This addresses bug GIRAPH-91.
    https://issues.apache.org/jira/browse/GIRAPH-91


Diffs
-----

  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/EdgeListVertex.java PRE-CREATION 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/TestJsonBase64Format.java 1202898 
  http://svn.apache.org/repos/asf/incubator/giraph/trunk/src/test/java/org/apache/giraph/graph/TestEdgeListVertex.java PRE-CREATION 

Diff: https://reviews.apache.org/r/2868/diff


Testing
-------

Local unittests, PageRankBenchmark on multiple machines with >20GB heaps.


Thanks,

Avery


                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching updated GIRAPH-91:
------------------------------

    Attachment: GIRAPH-91.diff

Sure, sorry about that.
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching reassigned GIRAPH-91:
---------------------------------

    Assignee: Avery Ching
    
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151844#comment-13151844 ] 

Avery Ching commented on GIRAPH-91:
-----------------------------------

Arun, we can certainly try other data structure for other BasicVertex implementations.  This one is a meant for pretty decent memory reduction.  I expect we will have a bunch of different implementations based on the requirements of the application.
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Jakob Homan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151600#comment-13151600 ] 

Jakob Homan commented on GIRAPH-91:
-----------------------------------

can you attach the patch to the jira, for non-rb review?
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Claudio Martella (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151946#comment-13151946 ] 

Claudio Martella commented on GIRAPH-91:
----------------------------------------

+1 for me. :)
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching resolved GIRAPH-91.
-------------------------------

    Resolution: Fixed

Thanks for the quick review Claudio.  Hudson +1'ed it as well, resolving.
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151965#comment-13151965 ] 

Hudson commented on GIRAPH-91:
------------------------------

Integrated in Giraph-trunk-Commit #36 (See [https://builds.apache.org/job/Giraph-trunk-Commit/36/])
    GIRAPH-91: Large-memory improvements (Memory reduced vertex
implementation, fast failure, added settings). (aching)

aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1203130
Files : 
* /incubator/giraph/trunk/CHANGELOG
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/EdgeListVertex.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestJsonBase64Format.java
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/graph
* /incubator/giraph/trunk/src/test/java/org/apache/giraph/graph/TestEdgeListVertex.java

                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Arun Suresh (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151823#comment-13151823 ] 

Arun Suresh commented on GIRAPH-91:
-----------------------------------

Avery, I see that you have used 2 sorted ArrayLists. Couldnt a LinkedHashMap have been an alternative ? I understand that the getEdgeValue and hasEdgeVale would be faster if it were a sortedArrayList. Also arraylists are more compact. But I was just wondering.. in the event that the graph is truly large (millions of edges, for a vertex) would it make sense to have the entire edgelist in memory in the first place ? we might need a scheme where only a part of the list is in memory and have chunks of the list fetched on demand when the provided iterator calls next(). In which case we can have a hybrid array + linked list (linked list of chunks of the edgelist)
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151603#comment-13151603 ] 

Avery Ching commented on GIRAPH-91:
-----------------------------------

By the way, rb allows you to download the diff directly (so you don't have to worry about them staying in sync).

https://reviews.apache.org/r/2868/diff/raw/
                
> Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) 
> -----------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-91
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-91
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Avery Ching
>            Assignee: Avery Ching
>         Attachments: GIRAPH-91.diff
>
>
> Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs.  The default settings in Giraph need to be improved for large graphs and heaps of >20G.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira