You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Eric Kimbrel <le...@gmail.com> on 2014/01/29 18:08:46 UTC

duplicate edges created with TextVertexInputFormat

I am reading in an adjacency list using an input format which extends TextVertexInputFormat.  My code doesn’t do anything to address input splits, but leaves that to the underlying giraph implementation.  However it appears that as the data is being read 2 identical input splits are created and read in, resulting in edges for each vertex being created twice.

My input format is a simple adjacency list, where each node is represented by a single line of text which lists the node id, and all of its neighbors.
I read the edges into an edge list and then create the vertex via:
Vertex<Text, LouvainNodeState, LongWritable> vertex = this.getConf().createVertex();
vertex.initialize(id, state, edgesList);


Logs below show the edges being read in twice (as part of two different input splits in the input stage) and then being represented twice per node in the computation phase.
This example is using 1 compute thread and 1 worker.

If I am creating the vertex incorrectly or doing something else wrong please let me know.  Thanks.



Log snippet of vertex input process.

14/01/28 11:02:41 INFO worker.BspServiceWorker: loadInputSplits: Using 1 thread(s), originally 1 threads(s) for 2 total splits.
14/01/28 11:02:41 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved input split path /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0, overall roughly 0.0% input splits reserved
14/01/28 11:02:41 INFO worker.InputSplitsCallable: getInputSplit: Reserved /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0 from ZooKeeper and got input split 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/small:0+172'
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1

… other nodes processed

14/01/28 11:02:42 INFO worker.InputSplitsCallable: loadFromInputSplit: Finished loading /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0 (v=9, e=34)
14/01/28 11:02:42 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved input split path /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1, overall roughly 50.0% input splits reserved
14/01/28 11:02:42 INFO worker.InputSplitsCallable: getInputSplit: Reserved /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1 from ZooKeeper and got input split 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/small:0+172'
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1
14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1

… other nodes processed again


Logs from the compute phase show that edges really are added twice  (format below shows edge #:target:weight)
While each node should only have one edge to each other, it instead has two.

4/01/28 11:02:42 INFO giraph.LouvainVertexComputation: NODE:  1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 1: 2:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 2: 3:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 3: 4:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 4: 5:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 5: 6:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 6: 2:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 7: 3:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 8: 4:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 9: 5:1
14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: 
EDGE 10: 6:1



Re: duplicate edges created with TextVertexInputFormat

Posted by Rob Vesse <rv...@dotnetrdf.org>.
The logs appear to show that you get two identical input slits:

14/01/28 11:02:41 INFO worker.InputSplitsCallable: getInputSplit: Reserved
/_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/
0 from ZooKeeper and got input split
'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/sm
all:0+172'

14/01/28 11:02:42 INFO worker.InputSplitsCallable: getInputSplit: Reserved
/_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/
1 from ZooKeeper and got input split
'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/sm
all:0+172'

Have you by any chance accidentally passed in the input file twice?

Rob

From:  Eric Kimbrel <le...@gmail.com>
Reply-To:  <us...@giraph.apache.org>
Date:  Wednesday, 29 January 2014 09:08
To:  <us...@giraph.apache.org>
Subject:  duplicate edges created with TextVertexInputFormat

> I am reading in an adjacency list using an input format which extends
> TextVertexInputFormat.  My code doesn¹t do anything to address input splits,
> but leaves that to the underlying giraph implementation.  However it appears
> that as the data is being read 2 identical input splits are created and read
> in, resulting in edges for each vertex being created twice.
> 
> My input format is a simple adjacency list, where each node is represented by
> a single line of text which lists the node id, and all of its neighbors.
> I read the edges into an edge list and then create the vertex via:
>> Vertex<Text, LouvainNodeState, LongWritable> vertex =
>> this.getConf().createVertex();
>> vertex.initialize(id, state, edgesList);
> 
> 
> Logs below show the edges being read in twice (as part of two different input
> splits in the input stage) and then being represented twice per node in the
> computation phase.
> This example is using 1 compute thread and 1 worker.
> 
> If I am creating the vertex incorrectly or doing something else wrong please
> let me know.  Thanks.
> 
> 
> 
> Log snippet of vertex input process.
> 
> 14/01/28 11:02:41 INFO worker.BspServiceWorker: loadInputSplits: Using 1
> thread(s), originally 1 threads(s) for 2 total splits.
> 14/01/28 11:02:41 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved
> input split path 
> /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0,
> overall roughly 0.0% input splits reserved
> 14/01/28 11:02:41 INFO worker.InputSplitsCallable: getInputSplit: Reserved
> /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0
> from ZooKeeper and got input split
> 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/smal
> l:0+172'
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1
> 
> Š other nodes processed
> 
> 14/01/28 11:02:42 INFO worker.InputSplitsCallable: loadFromInputSplit:
> Finished loading 
> /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0
> (v=9, e=34)
> 14/01/28 11:02:42 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved
> input split path 
> /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1,
> overall roughly 50.0% input splits reserved
> 14/01/28 11:02:42 INFO worker.InputSplitsCallable: getInputSplit: Reserved
> /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1
> from ZooKeeper and got input split
> 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/smal
> l:0+172'
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1
> 
> Š other nodes processed again
> 
> 
> Logs from the compute phase show that edges really are added twice  (format
> below shows edge #:target:weight)
> While each node should only have one edge to each other, it instead has two.
> 
> 4/01/28 11:02:42 INFO giraph.LouvainVertexComputation: NODE:  1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 1: 2:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 2: 3:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 3: 4:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 4: 5:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 5: 6:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 6: 2:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 7: 3:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 8: 4:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 9: 5:1
> 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 10: 6:1
> 
>