You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by "G.W." <gw...@gmail.com> on 2015/03/10 04:54:54 UTC

Undirected Vertex Definition and Reflexivity

Hi Giraph Mailing List,

I am writing about an undirected graph I am trying to move to Giraph. I
have a question about the assumption Giraph makes when processing an input.

Let V1 and V2, two vertices connected with a common edge. E1 defines an
edge from V1 to V2. E2 defines an edge from V2 to V1.

Simply put, these are defined in an input file as:
V1, E1
V2, E2

This is working fine, I can process the graph accordingly.

I was trying to see what would happen if I was to simplify the input file:
V1, E1
V2

When would come the time that V2 is processed in a superstep, Giraph would
not suggest E1 as an  outgoing edge. My questions is why, knowing that E1
defines an edge from V1 to V2. The graph being undirected (although there
is no provision for that in my Giraph computation), shouldn't Giraph assume
that V2 is connected to V1?

Down the road, the idea would be to streamline the input format, hence my
question.

Thanks!

Re: Undirected Vertex Definition and Reflexivity

Posted by "G.W." <gw...@gmail.com>.
Hi guys,

Here's a follow-up on this based on everyone's positive feedback.

I was able to load edges and vertices separately, leveraging the -eip (and
-eif) and -vip (and -eip) flags.

Implementation wise, since I may have disconnected vertices, I went ahead
with creating one EdgeReader and one VertexReader. Those readers are
initialised in their respective InputFormat classes.

Once done and tested, I have implemented reversibility,
using IntNullReverseTextEdgeInputFormat as an example, by extending the
edge input format class.

I am now able to load edges and vertices in two files:

vertex_id, property_A, property_B, property_C

for vertices definition

and

vertex_id_source, vertex_id_target

for edges, with implicitly creating a reverse edge for each edge.

Works like a charm!


On 11 March 2015 at 19:23, Matthew Saltz <sa...@gmail.com> wrote:

> Hi,
>
> I believe the answer to your question is yes, though I've never done it.
> If you use only the edge reader, only the vertices in your graph that have
> at least one edge attached to them will be present in your graph. So, if
> you have vertices that are entirely disconnected that you want included,
> you'd need to do both a VertexReader and an Edge Reader (though I've never
> done this). If you don't have disconnected vertices, you don't need the
> VertexReader because Giraph will automatically add all of the vertices in
> your edge file to the graph (I think this can be disabled). Use the -eip
> flag to specify the edge file.
>
> Best,
> Matthew Saltz
>
> On Wed, Mar 11, 2015 at 1:54 AM, G.W. <gw...@gmail.com> wrote:
>
>> Thanks for that!
>>
>> This is the right idea, however I was only using a VertexReader until now
>> – IntNullReverseTextEdgeInputFormat calls for an EdgeReader.
>>
>> I am not sure this is the way it works but I like the idea of segregating
>> edge and vertex definitions.
>>
>> *That leads to the following questions: can Giraph support the use of a
>> VertexReader and EdgeReader at the same time, that is through the -vif and
>> -eif arguments? *
>>
>> If that works, the idea would be to refactor my input files with:
>>
>> Vertices:
>> vertex_id, vertex_type, ...
>>
>> Edges
>> source_id, target_id
>>
>> with the edge reader working in "reverse" mode.
>>
>> Thanks!
>>
>>
>>
>>
>> On 10 March 2015 at 20:02, Matthew Saltz <sa...@gmail.com> wrote:
>>
>>> Have a look at IntNullReverseTextEdgeInputFormat
>>> <https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/IntNullReverseTextEdgeInputFormat.html>.
>>> It automatically creates reverse edges, but it expects the file format
>>>
>>> <source_id, target_id>
>>>
>>> on each line. If you need to convert it to use longs you can change the
>>> code pretty easily.
>>>
>>> Best,
>>> Matthew
>>>
>>> On Tue, Mar 10, 2015 at 5:37 AM, Young Han <yo...@uwaterloo.ca>
>>> wrote:
>>>
>>>> The input is assumed to be the vertex followed by a set of *directed*
>>>> edges. So, in your example, leaving out E2 means that the final graph will
>>>> not have the directed edge from V2 to V1. To get an undirected edge, you
>>>> need a pair of directed edges.
>>>>
>>>> Internally, Giraph stores the out-edges of each vertex as an adjacency
>>>> list at that vertex. So, for example, your undirected graph becomes a
>>>> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
>>>> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
>>>> list {V2} and a vertex V2 with an empty adjacency list {}.
>>>>
>>>> There's no easy way for Giraph to infer that V2's adjacency list should
>>>> contain V1, because V2 does not track its in-edges. To get around this, you
>>>> can either (1) use an undirected input file with both pairs of edges
>>>> present; (2) have, in your algorithms, all vertices broadcast their ids to
>>>> their out-edge neighbours and then perform mutations to add the missing
>>>> edges in the first two superstep; or (3) modify the code in
>>>> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
>>>> (i.e., add a new "type" of input format). I'm fairly certain that there
>>>> doesn't already exist an "assume undirected graph" input reader, but I'm
>>>> not too familiar with the code paths and options there so I could be wrong.
>>>>
>>>> Young
>>>>
>>>> On Mon, Mar 9, 2015 at 11:54 PM, G.W. <gw...@gmail.com> wrote:
>>>>
>>>>> Hi Giraph Mailing List,
>>>>>
>>>>> I am writing about an undirected graph I am trying to move to Giraph.
>>>>> I have a question about the assumption Giraph makes when processing an
>>>>> input.
>>>>>
>>>>> Let V1 and V2, two vertices connected with a common edge. E1 defines
>>>>> an edge from V1 to V2. E2 defines an edge from V2 to V1.
>>>>>
>>>>> Simply put, these are defined in an input file as:
>>>>> V1, E1
>>>>> V2, E2
>>>>>
>>>>> This is working fine, I can process the graph accordingly.
>>>>>
>>>>> I was trying to see what would happen if I was to simplify the input
>>>>> file:
>>>>> V1, E1
>>>>> V2
>>>>>
>>>>> When would come the time that V2 is processed in a superstep, Giraph
>>>>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>>>>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>>>>> there is no provision for that in my Giraph computation), shouldn't Giraph
>>>>> assume that V2 is connected to V1?
>>>>>
>>>>> Down the road, the idea would be to streamline the input format, hence
>>>>> my question.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Undirected Vertex Definition and Reflexivity

Posted by Matthew Saltz <sa...@gmail.com>.
Hi,

I believe the answer to your question is yes, though I've never done it. If
you use only the edge reader, only the vertices in your graph that have at
least one edge attached to them will be present in your graph. So, if you
have vertices that are entirely disconnected that you want included, you'd
need to do both a VertexReader and an Edge Reader (though I've never done
this). If you don't have disconnected vertices, you don't need the
VertexReader because Giraph will automatically add all of the vertices in
your edge file to the graph (I think this can be disabled). Use the -eip
flag to specify the edge file.

Best,
Matthew Saltz

On Wed, Mar 11, 2015 at 1:54 AM, G.W. <gw...@gmail.com> wrote:

> Thanks for that!
>
> This is the right idea, however I was only using a VertexReader until now
> – IntNullReverseTextEdgeInputFormat calls for an EdgeReader.
>
> I am not sure this is the way it works but I like the idea of segregating
> edge and vertex definitions.
>
> *That leads to the following questions: can Giraph support the use of a
> VertexReader and EdgeReader at the same time, that is through the -vif and
> -eif arguments? *
>
> If that works, the idea would be to refactor my input files with:
>
> Vertices:
> vertex_id, vertex_type, ...
>
> Edges
> source_id, target_id
>
> with the edge reader working in "reverse" mode.
>
> Thanks!
>
>
>
>
> On 10 March 2015 at 20:02, Matthew Saltz <sa...@gmail.com> wrote:
>
>> Have a look at IntNullReverseTextEdgeInputFormat
>> <https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/IntNullReverseTextEdgeInputFormat.html>.
>> It automatically creates reverse edges, but it expects the file format
>>
>> <source_id, target_id>
>>
>> on each line. If you need to convert it to use longs you can change the
>> code pretty easily.
>>
>> Best,
>> Matthew
>>
>> On Tue, Mar 10, 2015 at 5:37 AM, Young Han <yo...@uwaterloo.ca>
>> wrote:
>>
>>> The input is assumed to be the vertex followed by a set of *directed*
>>> edges. So, in your example, leaving out E2 means that the final graph will
>>> not have the directed edge from V2 to V1. To get an undirected edge, you
>>> need a pair of directed edges.
>>>
>>> Internally, Giraph stores the out-edges of each vertex as an adjacency
>>> list at that vertex. So, for example, your undirected graph becomes a
>>> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
>>> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
>>> list {V2} and a vertex V2 with an empty adjacency list {}.
>>>
>>> There's no easy way for Giraph to infer that V2's adjacency list should
>>> contain V1, because V2 does not track its in-edges. To get around this, you
>>> can either (1) use an undirected input file with both pairs of edges
>>> present; (2) have, in your algorithms, all vertices broadcast their ids to
>>> their out-edge neighbours and then perform mutations to add the missing
>>> edges in the first two superstep; or (3) modify the code in
>>> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
>>> (i.e., add a new "type" of input format). I'm fairly certain that there
>>> doesn't already exist an "assume undirected graph" input reader, but I'm
>>> not too familiar with the code paths and options there so I could be wrong.
>>>
>>> Young
>>>
>>> On Mon, Mar 9, 2015 at 11:54 PM, G.W. <gw...@gmail.com> wrote:
>>>
>>>> Hi Giraph Mailing List,
>>>>
>>>> I am writing about an undirected graph I am trying to move to Giraph. I
>>>> have a question about the assumption Giraph makes when processing an input.
>>>>
>>>> Let V1 and V2, two vertices connected with a common edge. E1 defines an
>>>> edge from V1 to V2. E2 defines an edge from V2 to V1.
>>>>
>>>> Simply put, these are defined in an input file as:
>>>> V1, E1
>>>> V2, E2
>>>>
>>>> This is working fine, I can process the graph accordingly.
>>>>
>>>> I was trying to see what would happen if I was to simplify the input
>>>> file:
>>>> V1, E1
>>>> V2
>>>>
>>>> When would come the time that V2 is processed in a superstep, Giraph
>>>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>>>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>>>> there is no provision for that in my Giraph computation), shouldn't Giraph
>>>> assume that V2 is connected to V1?
>>>>
>>>> Down the road, the idea would be to streamline the input format, hence
>>>> my question.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>
>

Re: Undirected Vertex Definition and Reflexivity

Posted by "G.W." <gw...@gmail.com>.
Thanks for that!

This is the right idea, however I was only using a VertexReader until now
– IntNullReverseTextEdgeInputFormat calls for an EdgeReader.

I am not sure this is the way it works but I like the idea of segregating
edge and vertex definitions.

*That leads to the following questions: can Giraph support the use of a
VertexReader and EdgeReader at the same time, that is through the -vif and
-eif arguments? *

If that works, the idea would be to refactor my input files with:

Vertices:
vertex_id, vertex_type, ...

Edges
source_id, target_id

with the edge reader working in "reverse" mode.

Thanks!




On 10 March 2015 at 20:02, Matthew Saltz <sa...@gmail.com> wrote:

> Have a look at IntNullReverseTextEdgeInputFormat
> <https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/IntNullReverseTextEdgeInputFormat.html>.
> It automatically creates reverse edges, but it expects the file format
>
> <source_id, target_id>
>
> on each line. If you need to convert it to use longs you can change the
> code pretty easily.
>
> Best,
> Matthew
>
> On Tue, Mar 10, 2015 at 5:37 AM, Young Han <yo...@uwaterloo.ca> wrote:
>
>> The input is assumed to be the vertex followed by a set of *directed*
>> edges. So, in your example, leaving out E2 means that the final graph will
>> not have the directed edge from V2 to V1. To get an undirected edge, you
>> need a pair of directed edges.
>>
>> Internally, Giraph stores the out-edges of each vertex as an adjacency
>> list at that vertex. So, for example, your undirected graph becomes a
>> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
>> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
>> list {V2} and a vertex V2 with an empty adjacency list {}.
>>
>> There's no easy way for Giraph to infer that V2's adjacency list should
>> contain V1, because V2 does not track its in-edges. To get around this, you
>> can either (1) use an undirected input file with both pairs of edges
>> present; (2) have, in your algorithms, all vertices broadcast their ids to
>> their out-edge neighbours and then perform mutations to add the missing
>> edges in the first two superstep; or (3) modify the code in
>> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
>> (i.e., add a new "type" of input format). I'm fairly certain that there
>> doesn't already exist an "assume undirected graph" input reader, but I'm
>> not too familiar with the code paths and options there so I could be wrong.
>>
>> Young
>>
>> On Mon, Mar 9, 2015 at 11:54 PM, G.W. <gw...@gmail.com> wrote:
>>
>>> Hi Giraph Mailing List,
>>>
>>> I am writing about an undirected graph I am trying to move to Giraph. I
>>> have a question about the assumption Giraph makes when processing an input.
>>>
>>> Let V1 and V2, two vertices connected with a common edge. E1 defines an
>>> edge from V1 to V2. E2 defines an edge from V2 to V1.
>>>
>>> Simply put, these are defined in an input file as:
>>> V1, E1
>>> V2, E2
>>>
>>> This is working fine, I can process the graph accordingly.
>>>
>>> I was trying to see what would happen if I was to simplify the input
>>> file:
>>> V1, E1
>>> V2
>>>
>>> When would come the time that V2 is processed in a superstep, Giraph
>>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>>> there is no provision for that in my Giraph computation), shouldn't Giraph
>>> assume that V2 is connected to V1?
>>>
>>> Down the road, the idea would be to streamline the input format, hence
>>> my question.
>>>
>>> Thanks!
>>>
>>>
>>>
>>
>

Re: Undirected Vertex Definition and Reflexivity

Posted by Matthew Saltz <sa...@gmail.com>.
Have a look at IntNullReverseTextEdgeInputFormat
<https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/IntNullReverseTextEdgeInputFormat.html>.
It automatically creates reverse edges, but it expects the file format

<source_id, target_id>

on each line. If you need to convert it to use longs you can change the
code pretty easily.

Best,
Matthew

On Tue, Mar 10, 2015 at 5:37 AM, Young Han <yo...@uwaterloo.ca> wrote:

> The input is assumed to be the vertex followed by a set of *directed*
> edges. So, in your example, leaving out E2 means that the final graph will
> not have the directed edge from V2 to V1. To get an undirected edge, you
> need a pair of directed edges.
>
> Internally, Giraph stores the out-edges of each vertex as an adjacency
> list at that vertex. So, for example, your undirected graph becomes a
> vertex object V1 with an adjacency list {V2} and a vertex object V2 with an
> adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
> list {V2} and a vertex V2 with an empty adjacency list {}.
>
> There's no easy way for Giraph to infer that V2's adjacency list should
> contain V1, because V2 does not track its in-edges. To get around this, you
> can either (1) use an undirected input file with both pairs of edges
> present; (2) have, in your algorithms, all vertices broadcast their ids to
> their out-edge neighbours and then perform mutations to add the missing
> edges in the first two superstep; or (3) modify the code in
> org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
> (i.e., add a new "type" of input format). I'm fairly certain that there
> doesn't already exist an "assume undirected graph" input reader, but I'm
> not too familiar with the code paths and options there so I could be wrong.
>
> Young
>
> On Mon, Mar 9, 2015 at 11:54 PM, G.W. <gw...@gmail.com> wrote:
>
>> Hi Giraph Mailing List,
>>
>> I am writing about an undirected graph I am trying to move to Giraph. I
>> have a question about the assumption Giraph makes when processing an input.
>>
>> Let V1 and V2, two vertices connected with a common edge. E1 defines an
>> edge from V1 to V2. E2 defines an edge from V2 to V1.
>>
>> Simply put, these are defined in an input file as:
>> V1, E1
>> V2, E2
>>
>> This is working fine, I can process the graph accordingly.
>>
>> I was trying to see what would happen if I was to simplify the input
>> file:
>> V1, E1
>> V2
>>
>> When would come the time that V2 is processed in a superstep, Giraph
>> would not suggest E1 as an  outgoing edge. My questions is why, knowing
>> that E1 defines an edge from V1 to V2. The graph being undirected (although
>> there is no provision for that in my Giraph computation), shouldn't Giraph
>> assume that V2 is connected to V1?
>>
>> Down the road, the idea would be to streamline the input format, hence my
>> question.
>>
>> Thanks!
>>
>>
>>
>

Re: Undirected Vertex Definition and Reflexivity

Posted by Young Han <yo...@uwaterloo.ca>.
The input is assumed to be the vertex followed by a set of *directed*
edges. So, in your example, leaving out E2 means that the final graph will
not have the directed edge from V2 to V1. To get an undirected edge, you
need a pair of directed edges.

Internally, Giraph stores the out-edges of each vertex as an adjacency list
at that vertex. So, for example, your undirected graph becomes a vertex
object V1 with an adjacency list {V2} and a vertex object V2 with an
adjacency list {V1}. The directed graph would be a vertex V1 with adjacency
list {V2} and a vertex V2 with an empty adjacency list {}.

There's no easy way for Giraph to infer that V2's adjacency list should
contain V1, because V2 does not track its in-edges. To get around this, you
can either (1) use an undirected input file with both pairs of edges
present; (2) have, in your algorithms, all vertices broadcast their ids to
their out-edge neighbours and then perform mutations to add the missing
edges in the first two superstep; or (3) modify the code in
org.apache.giraph.io.* (in giraph-core) to cache and add missing edges
(i.e., add a new "type" of input format). I'm fairly certain that there
doesn't already exist an "assume undirected graph" input reader, but I'm
not too familiar with the code paths and options there so I could be wrong.

Young

On Mon, Mar 9, 2015 at 11:54 PM, G.W. <gw...@gmail.com> wrote:

> Hi Giraph Mailing List,
>
> I am writing about an undirected graph I am trying to move to Giraph. I
> have a question about the assumption Giraph makes when processing an input.
>
> Let V1 and V2, two vertices connected with a common edge. E1 defines an
> edge from V1 to V2. E2 defines an edge from V2 to V1.
>
> Simply put, these are defined in an input file as:
> V1, E1
> V2, E2
>
> This is working fine, I can process the graph accordingly.
>
> I was trying to see what would happen if I was to simplify the input file:
> V1, E1
> V2
>
> When would come the time that V2 is processed in a superstep, Giraph would
> not suggest E1 as an  outgoing edge. My questions is why, knowing that E1
> defines an edge from V1 to V2. The graph being undirected (although there
> is no provision for that in my Giraph computation), shouldn't Giraph assume
> that V2 is connected to V1?
>
> Down the road, the idea would be to streamline the input format, hence my
> question.
>
> Thanks!
>
>
>