You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Jan van der Lugt <ja...@gmail.com> on 2012/04/24 17:53:42 UTC

Missing features in Giraph

Dear all,

After having worked with Giraph for some weeks I feel like there are two
features 'missing' in Giraph. It may be I simply missed them in the
Javadoc, since the documentation is a work in progress at this point. In
another Google Pregel-clone, Stanford GPS, it is possible to define a
global object map, which can be used by all workers to share data, like the
current phase in the algorithm. I have not been able to find such a feature
in Giraph. Of course it would be possible to (ab)use aggregators for this,
but I doubt this is the easiest or most efficient approach. Furthermore, it
would be very helpful if there would be one special vertex that has the
role of a master. This should not have to correspond to an existing vertex
in the graph, it would be easier if it were not, actually. This master node
would then be able to perform some centralized steps in the algorithm, of
which the output can then be shared with other workers via the global
object map. The master node could have the same interface as the workers
(compute(), getAggregator(), getConf(), etc.). Again, it would be possible
to solve this otherwise, for example in the VertexReader, but this would
make code less elegant and would require picking a vertex id that does not
exist in the graph, which is difficult if the input is not known in advance.

I realize I am biased because my earlier experiences with Stanford GPS, but
I feel these features will not be very difficult to implement or would add
bulkiness to the API. They can make the implementation of many graph
algorithms easier, though, because many of these algorithms have some
notion of a centralized master node. During the next 5 months I will be
working with Giraph for my Master's project, so I would be more than
willing to help out implementing these features, ideally after receiving
some pointers from more experienced Giraph developers.

Regards,
Jan van der Lugt

Re: Missing features in Giraph

Posted by Semih Salihoglu <se...@stanford.edu>.
Hi All,

Just wanted to add a note here. I'm the student who's working on GPS at
Stanford and I've been talking to Avery from time to time about some of the
features in GPS that we could also implement in Giraph. About the
GlobalObjectsMap: This is meant to be GPS' implementation of Aggregators
actually. The way they're accessed and modified by the vertices might be
easier. Otherwise Aggregators and GlobalObjects are equivalent. But what
Jan is referring to maybe that it is easier/more explicit to
clear/update/insert this map in GPS than in Giraph.

For the master vertex: I'm very embarrassed to say this but I have this
Jira https://issues.apache.org/jira/browse/GIRAPH-127, to implement this in
Giraph. Avery and I talked about this and some other person back in the
time also commented on it. Unfortunately, I haven't gotten to it yet.
Master.compute(), is a nice optimization useful for the following case: If
you have an algorithm that consists of 2 or more vertex-centric parts, such
as we first find connected components and then partition each component,
then usually in between such vertex-centric computations, there's a global
computation, that is not vertex-centric. I give the following k-means-like
algorithm as an example: 1) pick k cluster centers, 2) assign each vertex
to the closest cluster, 3) count the number of edges crossing clusters, 4)
if the clustering is good enough stop, otherwise go back to 1. For example,
in this algorithm picking k cluster centers is not vertex-centric. One
special vertex has to pick k cluster centers and assign it to an aggregator
or (GlobalObject in GPS). Master.compute() is intended to be the right
abstraction for such global computations and to coordinate multiple
vertex-centric components of an algorithm.  It also eliminates iterations
in which a special vertex does some global computation and every other
vertex is idle. Such iterations are usually very short but still wasteful
to have all machines be idle for the short amount of time. In short,
master.compute() makes it easier to write algorithms that consists of
multiple vertex-centric parts.

That said, I haven't gotten to implementing Master.compute() in Giraph yet.
Starting from July looks like a promising time for me to work on it, if
somebody does not want to do it earlier.

Thank you,

semih
On Tue, Apr 24, 2012 at 9:36 AM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> Hi Jan,
>
> I hear you.
> One thing that is not clear to me is whether the global object map and
> the master vertex can actually be the same thing (a centralizer). A
> rationalization of these two objects could actually get us to have
> only one added feature instead of two similar ones.
> The one thing that doesn't convince me about the centralization of
> computation would be the risk of creating a hot-spot, a computational
> bottleneck that would just create a big serial computation path
> before/after the parallel one.
> My claim is that if the same thing can be done without a
> centralization, it should be done that way because of obvious
> performance issues. Centralization helps user's life, but it's often
> the root for bottlenecks.
> I guess that a centralizer could be defined hierarchically, like
> aggregators, so once per worker and on the master by composing the
> output of the workers'. This computation should be commutative and
> associative, I guess I'm biased by the definition of aggregators (but
> this requirement is quite obvious and therefore recurrent in
> distributed systems for many reasons).
>
> My claim is that i'd think twice before adding features because they
> make user's life easier at first sight. For example, such a feature
> would have been useful in mapreduce as well, but the focus on having a
> simple and quite restrictive paradigm (like no communication between
> mappers, which is something that everybody would want) *should* win on
> the long term. These are only general concerns on the topic, it would
> be very useful to discuss a more defined set of modifications to the
> current model so that we could actually be more precise in the
> evaluation.
>
> Would you like working on such a definition?
>
> On Tue, Apr 24, 2012 at 5:53 PM, Jan van der Lugt <ja...@gmail.com>
> wrote:
> > Dear all,
> >
> > After having worked with Giraph for some weeks I feel like there are two
> > features 'missing' in Giraph. It may be I simply missed them in the
> Javadoc,
> > since the documentation is a work in progress at this point. In another
> > Google Pregel-clone, Stanford GPS, it is possible to define a global
> object
> > map, which can be used by all workers to share data, like the current
> phase
> > in the algorithm. I have not been able to find such a feature in Giraph.
> Of
> > course it would be possible to (ab)use aggregators for this, but I doubt
> > this is the easiest or most efficient approach. Furthermore, it would be
> > very helpful if there would be one special vertex that has the role of a
> > master. This should not have to correspond to an existing vertex in the
> > graph, it would be easier if it were not, actually. This master node
> would
> > then be able to perform some centralized steps in the algorithm, of which
> > the output can then be shared with other workers via the global object
> map.
> > The master node could have the same interface as the workers (compute(),
> > getAggregator(), getConf(), etc.). Again, it would be possible to solve
> this
> > otherwise, for example in the VertexReader, but this would make code less
> > elegant and would require picking a vertex id that does not exist in the
> > graph, which is difficult if the input is not known in advance.
> >
> > I realize I am biased because my earlier experiences with Stanford GPS,
> but
> > I feel these features will not be very difficult to implement or would
> add
> > bulkiness to the API. They can make the implementation of many graph
> > algorithms easier, though, because many of these algorithms have some
> notion
> > of a centralized master node. During the next 5 months I will be working
> > with Giraph for my Master's project, so I would be more than willing to
> help
> > out implementing these features, ideally after receiving some pointers
> from
> > more experienced Giraph developers.
> >
> > Regards,
> > Jan van der Lugt
>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: Missing features in Giraph

Posted by Claudio Martella <cl...@gmail.com>.
Hi Jan,

I hear you.
One thing that is not clear to me is whether the global object map and
the master vertex can actually be the same thing (a centralizer). A
rationalization of these two objects could actually get us to have
only one added feature instead of two similar ones.
The one thing that doesn't convince me about the centralization of
computation would be the risk of creating a hot-spot, a computational
bottleneck that would just create a big serial computation path
before/after the parallel one.
My claim is that if the same thing can be done without a
centralization, it should be done that way because of obvious
performance issues. Centralization helps user's life, but it's often
the root for bottlenecks.
I guess that a centralizer could be defined hierarchically, like
aggregators, so once per worker and on the master by composing the
output of the workers'. This computation should be commutative and
associative, I guess I'm biased by the definition of aggregators (but
this requirement is quite obvious and therefore recurrent in
distributed systems for many reasons).

My claim is that i'd think twice before adding features because they
make user's life easier at first sight. For example, such a feature
would have been useful in mapreduce as well, but the focus on having a
simple and quite restrictive paradigm (like no communication between
mappers, which is something that everybody would want) *should* win on
the long term. These are only general concerns on the topic, it would
be very useful to discuss a more defined set of modifications to the
current model so that we could actually be more precise in the
evaluation.

Would you like working on such a definition?

On Tue, Apr 24, 2012 at 5:53 PM, Jan van der Lugt <ja...@gmail.com> wrote:
> Dear all,
>
> After having worked with Giraph for some weeks I feel like there are two
> features 'missing' in Giraph. It may be I simply missed them in the Javadoc,
> since the documentation is a work in progress at this point. In another
> Google Pregel-clone, Stanford GPS, it is possible to define a global object
> map, which can be used by all workers to share data, like the current phase
> in the algorithm. I have not been able to find such a feature in Giraph. Of
> course it would be possible to (ab)use aggregators for this, but I doubt
> this is the easiest or most efficient approach. Furthermore, it would be
> very helpful if there would be one special vertex that has the role of a
> master. This should not have to correspond to an existing vertex in the
> graph, it would be easier if it were not, actually. This master node would
> then be able to perform some centralized steps in the algorithm, of which
> the output can then be shared with other workers via the global object map.
> The master node could have the same interface as the workers (compute(),
> getAggregator(), getConf(), etc.). Again, it would be possible to solve this
> otherwise, for example in the VertexReader, but this would make code less
> elegant and would require picking a vertex id that does not exist in the
> graph, which is difficult if the input is not known in advance.
>
> I realize I am biased because my earlier experiences with Stanford GPS, but
> I feel these features will not be very difficult to implement or would add
> bulkiness to the API. They can make the implementation of many graph
> algorithms easier, though, because many of these algorithms have some notion
> of a centralized master node. During the next 5 months I will be working
> with Giraph for my Master's project, so I would be more than willing to help
> out implementing these features, ideally after receiving some pointers from
> more experienced Giraph developers.
>
> Regards,
> Jan van der Lugt



-- 
   Claudio Martella
   claudio.martella@gmail.com