You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Hassan Eslami (JIRA)" <ji...@apache.org> on 2016/06/15 18:28:09 UTC

[jira] [Commented] (GIRAPH-1073) Decouple out-of-core persistence infrastructure from out-of-core computation

    [ https://issues.apache.org/jira/browse/GIRAPH-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332260#comment-15332260 ] 

Hassan Eslami commented on GIRAPH-1073:
---------------------------------------

https://reviews.facebook.net/D59691

> Decouple out-of-core persistence infrastructure from out-of-core computation
> ----------------------------------------------------------------------------
>
>                 Key: GIRAPH-1073
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-1073
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Hassan Eslami
>            Assignee: Hassan Eslami
>
> In the current out-of-core infrastructure, the persistence layer is heavily intertwined with the scheduling and out-of-core engine. This makes it complicated to try new features for the persistence layer. The following changes are needed:
>  * The persistence layer should be decoupled from out-of-core infrastructure. This way one can simply implement and plug different data accessors for various persistence resources, e.g. local file system data accessor, HDFS data accessor, serialized in-memory data accessor, etc.
>  * We should be able to address out-of-core data in a more efficient and flexible way. Currently, data are accessed/addressed through string literals in various locations of the code. This should be changed so data can be accessed through a unified, more flexible data indexing mechanism.
>  * With different implementations of data accessor, now there may be more emphasis on having more IO threads. It is important that these IO threads are load-balanced. Currently, partitions are assigned to IO threads using a hash function. Hash function tent not to balance load with small number of data points (partitions in this case).
>  * Currently, out-of-core uses `BufferedInputStream` and `BufferedOutputStream` along with the default (de)serialization mechanism. The IO bandwidth achieved in the current implementation is low. One can simply use: 1) Unsafe (de)serialization mechanism to optimize for memory bandwidth during (de)serialization process, 2) RandomAccessFile's read and write interface to have lower level access to the local file system and avoid overheads in reading/writing from/to local files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)