You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Eli Reisman <ap...@gmail.com> on 2013/12/08 19:55:51 UTC

Re: vertex and data block co-location

Running Giraph on MapReduce, you have no control over where the worker
tasks will be hosted on the cluster. Therefore the partitioning generally
is not aware of co-located blocks and does a fair amount of time-consuming
network shuffling of data during the initialization of a Giraph job.

What Giraph does do is, as each worker tasks spins up on the cluster, it
attempts to claim input splits that happen to be local to the DataNode the
worker runs on. This speeds up the initial injestion of graph data quite a
bit, but does not help up much when it comes to distributing the data to
the worker that owns that data's assigned partition.

Only when all data have been been pushed to the appropriate worker can the
Giraph job actually begin. When data actually does end up belonging to a
host-local partition it is not sent over the network, but in many cases
there is no alternative without using an alternate to hash partitioning.

On Sat, Nov 16, 2013 at 12:22 PM, David J Garcia <dj...@utexas.edu> wrote:

> hello, I was wondering if there was a way to ensure that vertices located
> on the same data block (on hdfs) are co-located with each other?
>
> Also, will the vertices in input-splits (splits that are located on the
> same DataNode) have a reasonable chance of being partitioned to the same id?
>
> for example, suppose that I have vertex_1 located on data_block_i, and
> vertex_2 located on data_block_k.  Let's suppose that both of the data
> blocks are located on the same DataNode machine.  Is there a reasonably
> good chance that the vertex_1 and vertex_2 will partition to the same id?
>
> I'm doing a research project and I'm trying to show the benefits of graph
> data-locality.
>
> -David
>

RE: vertex and data block co-location

Posted by Pavan Kumar A <pa...@outlook.com>.

@DavidYou can have a look at http://researcher.watson.ibm.com/researcher/files/us-ytian/giraph++.pdfThis work was done by http://researcher.watson.ibm.com/researcher/view.php?person=us-ytianIn this she talks about alternative partitioning schemes she implemented on top of giraph and the showsthe resulting optimizations taking some graph algorithms as examples.
Date: Sun, 8 Dec 2013 10:55:51 -0800
Subject: Re: vertex and data block co-location
From: apache.mailbox@gmail.com
To: user@giraph.apache.org

Running Giraph on MapReduce, you have no control over where the worker tasks will be hosted on the cluster. Therefore the partitioning generally is not aware of co-located blocks and does a fair amount of time-consuming network shuffling of data during the initialization of a Giraph job.

What Giraph does do is, as each worker tasks spins up on the cluster, it attempts to claim input splits that happen to be local to the DataNode the worker runs on. This speeds up the initial injestion of graph data quite a bit, but does not help up much when it comes to distributing the data to the worker that owns that data's assigned partition.

Only when all data have been been pushed to the appropriate worker can the Giraph job actually begin. When data actually does end up belonging to a host-local partition it is not sent over the network, but in many cases there is no alternative without using an alternate to hash partitioning.

On Sat, Nov 16, 2013 at 12:22 PM, David J Garcia <dj...@utexas.edu> wrote:

hello, I was wondering if there was a way to ensure that vertices located on the same data block (on hdfs) are co-located with each other?

Also, will the vertices in input-splits (splits that are located on the same DataNode) have a reasonable chance of being partitioned to the same id?

for example, suppose that I have vertex_1 located on data_block_i, and vertex_2 located on data_block_k.  Let's suppose that both of the data blocks are located on the same DataNode machine.  Is there a reasonably good chance that the vertex_1 and vertex_2 will partition to the same id?

I'm doing a research project and I'm trying to show the benefits of graph data-locality.

-David