You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@giraph.apache.org by Peter Morgan <pm...@gmail.com> on 2013/01/20 15:11:31 UTC

Differences with Edge and Vertex Input Format

I'm interested in hearing about the differences in loading using the edge
and vertex inputs. In particular, I have a few questions:

1) How can vertex state be set using edge input format?
2) How are vertices with only in-edges initialised using edge input format?
3) Is either vertex or edge input more efficient for loading? I guess less
needs to be shuffled around the network using vertex input?
4) If your adjacency list for vertex input is pre-partitioned, does this
decrease loading time as again vertices don't need to be shuffled across
the network?

Thanks in advance for any help.
Peter

Re: Differences with Edge and Vertex Input Format

Posted by Eli Reisman <ap...@gmail.com>.

Giraph doesn't enforce which splits are read by which workers on which
compute node, but if you use locality (I can't remember if the locality
default is on or off?) then, if the underlying Hadoop framework holds
useful locality data (seems to vary depending on cluster setup,) you will
find that each worker and all input split reading threads will attempt at
first to read a split local to their compute node rather than over the
network, if that thread/worker can successfully claim one or more, before
attempting to read non local splits. This also assumes that the compute
nodes are doubling as datanodes for the cluster HDFS of course.

On Sun, Jan 20, 2013 at 8:42 AM, Alessandro Presta <al...@fb.com>wrote:

> Hi Peter,
>
> Good questions.
>
> 1) If you only specify an EdgeInputFormat, vertex values will be
> initialized to their type's default value. You can also specify a
> VertexValueInputFormat, which is just a more convenient API around
> VertexInputFormat to read vertex values.
>
> 2) They will be created as they receive the first message, unless you
> override VertexResolver with some other behavior.
>
> 3) In general vertex input is more efficient because of what you said, and
> because it's a more compact representation. However, if your original
> dataset is in the form of a list of edges, the additional step of grouping
> them by source vertex might be more expensive than doing that in Giraph
> (depending on your infrastructure).
>
> 4) We don't have a way to enforce which worker will read what splits, so I
> think in general you can expect most of the data to be shuffled across
> workers.
>
> Alessandro
>
> Sent from my iPhone
>
> On Jan 20, 2013, at 6:12 AM, "Peter Morgan" <pm...@gmail.com> wrote:
>
> > I'm interested in hearing about the differences in loading using the
> edge and vertex inputs. In particular, I have a few questions:
> >
> > 1) How can vertex state be set using edge input format?
> > 2) How are vertices with only in-edges initialised using edge input
> format?
> > 3) Is either vertex or edge input more efficient for loading? I guess
> less needs to be shuffled around the network using vertex input?
> > 4) If your adjacency list for vertex input is pre-partitioned, does this
> decrease loading time as again vertices don't need to be shuffled across
> the network?
> >
> > Thanks in advance for any help.
> > Peter
>

Re: Differences with Edge and Vertex Input Format

Posted by Alessandro Presta <al...@fb.com>.

Hi Peter,

Good questions.

1) If you only specify an EdgeInputFormat, vertex values will be initialized to their type's default value. You can also specify a VertexValueInputFormat, which is just a more convenient API around VertexInputFormat to read vertex values.

2) They will be created as they receive the first message, unless you override VertexResolver with some other behavior.

3) In general vertex input is more efficient because of what you said, and because it's a more compact representation. However, if your original dataset is in the form of a list of edges, the additional step of grouping them by source vertex might be more expensive than doing that in Giraph (depending on your infrastructure).

4) We don't have a way to enforce which worker will read what splits, so I think in general you can expect most of the data to be shuffled across workers.

Alessandro

Sent from my iPhone

On Jan 20, 2013, at 6:12 AM, "Peter Morgan" <pm...@gmail.com> wrote:

> I'm interested in hearing about the differences in loading using the edge and vertex inputs. In particular, I have a few questions:
> 
> 1) How can vertex state be set using edge input format?
> 2) How are vertices with only in-edges initialised using edge input format?
> 3) Is either vertex or edge input more efficient for loading? I guess less needs to be shuffled around the network using vertex input?
> 4) If your adjacency list for vertex input is pre-partitioned, does this decrease loading time as again vertices don't need to be shuffled across the network?
> 
> Thanks in advance for any help.
> Peter