You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Claudio Martella (JIRA)" <ji...@apache.org> on 2012/11/21 16:26:01 UTC

[jira] [Commented] (GIRAPH-436) Improvement of partition selection for out-of-core graph

    [ https://issues.apache.org/jira/browse/GIRAPH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502040#comment-13502040 ] 

Claudio Martella commented on GIRAPH-436:
-----------------------------------------

Alessandro, as you worked on the original GIRAPH-249, do you have any comments?
                
> Improvement of partition selection for out-of-core graph
> --------------------------------------------------------
>
>                 Key: GIRAPH-436
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-436
>             Project: Giraph
>          Issue Type: Improvement
>          Components: graph
>    Affects Versions: 0.2.0
>            Reporter: Claudio Martella
>            Priority: Minor
>
> Currently, the OOC graph component needs a parameter K that describes how many partitions should be kept in memory out of N (the number of partitions assigned to that worker). Hence N - K partitions are constantly scanned and from/to disk.
> Now, as discussed on GIRAPH-249, for algorithms were the all the vertices are NOT always active, e.g. SSSP, this can have an impact on performance, as we might get to (a) scan a partition on disk with very few active vertices (b) scan a partition on disk with NO active vertices (c) scan a partition on disk with all active vertices while we have some partitions with NO active vertices in memory. The impact of OOC graph on other algorithms, e.g. PR-like algorithms, is less important, as the cost of hitting the disk is completely amortized (each IO byte is actually computed as all the vertices are active) and efficient (scan-based).
> For the affected class of algorithms, two contributions are possible:
> 1) add a header to the partition file with the stats (a boolean hasActive field) about active vertices in the partition. If that partition hasn't received messages AND all the vertices are inactive, it can be skipped completely. Solves for problem (a). Cost: one random IO to the header at the end of each OOC partition computation.
> 2) greedy heuristic that tries to keep in memory only K partitions, when possible, that have hasActive set to true. Once an inactive partition is spilled to disk, and it stays to disk, it won't be scanned/read back to memory unless necessary (works with contribution (1)). This accounts for problem (c).
> As far as (b) is concerned, there's not much one can do without breaking linear scanning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira