You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Martin Junghanns <ma...@gmx.net> on 2014/11/04 08:36:16 UTC

Graph partitioning and data locality

Hi group,

I got a question concerning the graph partitioning step. If I understood 
the code correctly, the graph is distributed to n partitions by using 
vertexID.hashCode() & n. I got two questions concerning that step.

1) Is the whole graph loaded and partitioned only by the Master? This 
would mean, the whole data has to be moved to that Master map job and 
then moved to the physical node the specific worker for the partition 
runs on. As this sounds like a huge overhead, I further inspected the code:
I saw that there is also a WorkerGraphPartitioner and I assume he calls 
the partitioning method on his local data (lets say his local HDFS 
blocks) and if the resulting partition for a vertex is not himself, the 
data gets moved to that worker, which reduces the overhead. Is this 
assumption correct?

2) Let's say the graph is already partitioned in the file system, e.g. 
blocks on physical nodes contain logical connected graph nodes. Is it 
possible to just read the data as it is and skip the partitioning step? 
In that case I currently assume, that the vertexID should contain the 
partitionID and the custom partitioning would be an identity function in 
that case (instead of hashing or range).

Thanks for your time and help!

Cheers,
Martin

Re: Graph partitioning and data locality

Posted by Martin Junghanns <ma...@gmx.net>.
Thanks to you both and Lukas who answered me directly. The stated issue 
908 helps a lot and I will further look into this + physical layout.

On 04.11.2014 16:28, Pavan Kumar A wrote:
> You can also look at https://issues.apache.org/jira/browse/GIRAPH-908
> which solves the case where you have a partition map and would like 
> graph to be partitioned that way after loading the input. It does not 
> however solve the {do not shuffle data part}
>
> ------------------------------------------------------------------------
> From: claudio.martella@gmail.com
> Date: Tue, 4 Nov 2014 16:20:21 +0100
> Subject: Re: Graph partitioning and data locality
> To: user@giraph.apache.org
>
> Hi,
>
> answers are inline.
>
> On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns 
> <martin.junghanns@gmx.net <ma...@gmx.net>> wrote:
>
>     Hi group,
>
>     I got a question concerning the graph partitioning step. If I
>     understood the code correctly, the graph is distributed to n
>     partitions by using vertexID.hashCode() & n. I got two questions
>     concerning that step.
>
>     1) Is the whole graph loaded and partitioned only by the Master?
>     This would mean, the whole data has to be moved to that Master map
>     job and then moved to the physical node the specific worker for
>     the partition runs on. As this sounds like a huge overhead, I
>     further inspected the code:
>     I saw that there is also a WorkerGraphPartitioner and I assume he
>     calls the partitioning method on his local data (lets say his
>     local HDFS blocks) and if the resulting partition for a vertex is
>     not himself, the data gets moved to that worker, which reduces the
>     overhead. Is this assumption correct?
>
>
> That is correct, workers forward vertex data to the correct worker who 
> is responsible for that vertex via hash-partitioning (by default), 
> meaning that the master is not involved.
>
>
>     2) Let's say the graph is already partitioned in the file system,
>     e.g. blocks on physical nodes contain logical connected graph
>     nodes. Is it possible to just read the data as it is and skip the
>     partitioning step? In that case I currently assume, that the
>     vertexID should contain the partitionID and the custom
>     partitioning would be an identity function in that case (instead
>     of hashing or range).
>
>
> In principle you can. You would need to organize splits so that they 
> contain all the data for each particular worker, and then assign 
> relevant splits to the corresponding worker.
>
>
>     Thanks for your time and help!
>
>     Cheers,
>     Martin
>
>
>
>
> -- 
>    Claudio Martella


RE: Graph partitioning and data locality

Posted by Pavan Kumar A <pa...@outlook.com>.
You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which solves the case where you have a partition map and would like graph to be partitioned that way after loading the input. It does not however solve the {do not shuffle data part}

From: claudio.martella@gmail.com
Date: Tue, 4 Nov 2014 16:20:21 +0100
Subject: Re: Graph partitioning and data locality
To: user@giraph.apache.org

Hi,
answers are inline.
On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns <ma...@gmx.net> wrote:
Hi group,



I got a question concerning the graph partitioning step. If I understood the code correctly, the graph is distributed to n partitions by using vertexID.hashCode() & n. I got two questions concerning that step.



1) Is the whole graph loaded and partitioned only by the Master? This would mean, the whole data has to be moved to that Master map job and then moved to the physical node the specific worker for the partition runs on. As this sounds like a huge overhead, I further inspected the code:

I saw that there is also a WorkerGraphPartitioner and I assume he calls the partitioning method on his local data (lets say his local HDFS blocks) and if the resulting partition for a vertex is not himself, the data gets moved to that worker, which reduces the overhead. Is this assumption correct?

That is correct, workers forward vertex data to the correct worker who is responsible for that vertex via hash-partitioning (by default), meaning that the master is not involved. 


2) Let's say the graph is already partitioned in the file system, e.g. blocks on physical nodes contain logical connected graph nodes. Is it possible to just read the data as it is and skip the partitioning step? In that case I currently assume, that the vertexID should contain the partitionID and the custom partitioning would be an identity function in that case (instead of hashing or range).

In principle you can. You would need to organize splits so that they contain all the data for each particular worker, and then assign relevant splits to the corresponding worker. 


Thanks for your time and help!



Cheers,

Martin



-- 
    Claudio Martella
   
 		 	   		  

Re: Graph partitioning and data locality

Posted by Claudio Martella <cl...@gmail.com>.
Hi,

answers are inline.

On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns <ma...@gmx.net>
wrote:

> Hi group,
>
> I got a question concerning the graph partitioning step. If I understood
> the code correctly, the graph is distributed to n partitions by using
> vertexID.hashCode() & n. I got two questions concerning that step.
>
> 1) Is the whole graph loaded and partitioned only by the Master? This
> would mean, the whole data has to be moved to that Master map job and then
> moved to the physical node the specific worker for the partition runs on.
> As this sounds like a huge overhead, I further inspected the code:
> I saw that there is also a WorkerGraphPartitioner and I assume he calls
> the partitioning method on his local data (lets say his local HDFS blocks)
> and if the resulting partition for a vertex is not himself, the data gets
> moved to that worker, which reduces the overhead. Is this assumption
> correct?
>

That is correct, workers forward vertex data to the correct worker who is
responsible for that vertex via hash-partitioning (by default), meaning
that the master is not involved.


>
> 2) Let's say the graph is already partitioned in the file system, e.g.
> blocks on physical nodes contain logical connected graph nodes. Is it
> possible to just read the data as it is and skip the partitioning step? In
> that case I currently assume, that the vertexID should contain the
> partitionID and the custom partitioning would be an identity function in
> that case (instead of hashing or range).
>

In principle you can. You would need to organize splits so that they
contain all the data for each particular worker, and then assign relevant
splits to the corresponding worker.


>
> Thanks for your time and help!
>
> Cheers,
> Martin
>



-- 
   Claudio Martella

Re: Graph partitioning and data locality

Posted by Martin Junghanns <ma...@gmx.net>.
sorry for the typo (no coffee yet): vertexID.hashCode() *%* n

On 04.11.2014 08:36, Martin Junghanns wrote:
> Hi group,
>
> I got a question concerning the graph partitioning step. If I 
> understood the code correctly, the graph is distributed to n 
> partitions by using vertexID.hashCode() & n. I got two questions 
> concerning that step.
>
> 1) Is the whole graph loaded and partitioned only by the Master? This 
> would mean, the whole data has to be moved to that Master map job and 
> then moved to the physical node the specific worker for the partition 
> runs on. As this sounds like a huge overhead, I further inspected the 
> code:
> I saw that there is also a WorkerGraphPartitioner and I assume he 
> calls the partitioning method on his local data (lets say his local 
> HDFS blocks) and if the resulting partition for a vertex is not 
> himself, the data gets moved to that worker, which reduces the 
> overhead. Is this assumption correct?
>
> 2) Let's say the graph is already partitioned in the file system, e.g. 
> blocks on physical nodes contain logical connected graph nodes. Is it 
> possible to just read the data as it is and skip the partitioning 
> step? In that case I currently assume, that the vertexID should 
> contain the partitionID and the custom partitioning would be an 
> identity function in that case (instead of hashing or range).
>
> Thanks for your time and help!
>
> Cheers,
> Martin