You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@giraph.apache.org by "Semih Salihoglu (Created) (JIRA)" <ji...@apache.org> on 2012/01/19 21:52:40 UTC

[jira] [Created] (GIRAPH-127) Extending the API with a master.compute() function.

Extending the API with a master.compute() function.
---------------------------------------------------

                 Key: GIRAPH-127
                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
             Project: Giraph
          Issue Type: New Feature
          Components: bsp, examples, graph
            Reporter: Semih Salihoglu


First of all, sorry for the long explanation to this feature.

I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
 * Input G(V, E), k, numEdgesThreshold, maxIterations
 * Algorithm:
 * int numEdgesCrossingClusters = Integer.MAX_INT;
*  int iterationNo = 0;
 * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
 *    iterationNo++;
 *    int[] clusterCenters = pickKClusterCenters(k, G);
 *    findClusterCenters(G, clusterCenters);
 *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
 * }
The algorithm goes through the following steps in iterations:
1) Pick k random initial cluster centers
2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
3) Count the nuimber of edges crossing clusters
4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.

In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.

The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.

A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.

I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).

If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Re: [jira] [Created] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by Semih Salihoglu <se...@stanford.edu>.
I'm putting the answer in line.

On Thu, Jan 19, 2012 at 6:22 PM, Avery Ching <ac...@apache.org> wrote:

> Not sure if Semih is on the giraph-dev list.  Forwarding the question to
> him.
>
> Avery
>
> P.S.  Interesting idea if I understand correctly, attaching the compute
> functionality to an aggregator that the master will run between supersteps?
>
> On 1/19/12 1:20 PM, Claudio Martella wrote:
>
>> Hi Semih,
>>
>> interesting email. I'm probably not getting your technique right, but
>> why wouldn't it be possible to compute the master.compute() inside of
>> an aggregator?
>>
>> Not only it *should* be possible, but as aggregators are computed both
>> on workers AND on the master, you should have a faster computation.
>> for instance you could aggregate the number of cut edges on each
>> worker and aggregate the total number on the master. Same could happen
>> for choosing the centroids.
>>
> This is exactly how you would count the number of edges, I'm not
suggesting something else in my example. Actually if you look at the GPS
code for Kmeans from the link I sent, that's exactly how it's done.
Master.compute() is meant for something else. One thing it's meant for is
expressing sequential parts of an algorithm. Consider the checking of the
condition inside the while() in the pseudocode from the simple k-means
algorithm:
 while ((numEdgesCrossingCluster>  numEdgesThreshold)&&  iterationNo<
 maxIterations) { ... }

In order to do this simple condition check, the system somehow has to
understand that a) it's time to do that check, b) have access to the
aggregator that was used to count "numEdgesCrossingClusters" c) if that
condition fails, the computation should stop (to keep my email shorter, i'm
going to skip this last thing). An aggregator is only a global object,
which can be updated by vertices and the programmer defines only how
aggregation on that object should be done. By itself, it can't know when to
do that condition check. So in order to run this simple line of code, a lot
of global information needs to be available. As Avery pointed out in his
response, no matter where the programmer wants to express this condition
check, it has to have access to other aggregators. It also has to have
access to non-aggregator global data: in this example, iterationNo is a
global data but it should not be aggregated as it's not useful information
for the vertices.

As for picking the k random cluster centers, this cannot be done in an
aggregator either. I agree that after the cluster centers are picked, they
have to be put inside an aggregator, so that it's available to the workers.
An aggregator is more the storage location of the centers, not the location
where it's picked. The action of picking of these vertices needs to be done
in exactly one place, not per worker or per vertex. As I try to explain in
the previous email, the picking can be done in one of the preSuperstep()
methods in a special worker, but this would waste an entire superstep. In
general master.compute() would be the place where any kind of global,
non-vertex-centric computation is expressed, and it would be where the
programmer stores global information that is not used by the vertices and
hence should not be aggregated. I think the exact purpose might be clearer
in the actual GPS example for k-means than my explanation.



>> On Thu, Jan 19, 2012 at 9:52 PM, Semih Salihoglu (Created) (JIRA)
>> <ji...@apache.org>  wrote:
>>
>>> Extending the API with a master.compute() function.
>>> ------------------------------**---------------------
>>>
>>>                 Key: GIRAPH-127
>>>                 URL: https://issues.apache.org/**jira/browse/GIRAPH-127<https://issues.apache.org/jira/browse/GIRAPH-127>
>>>             Project: Giraph
>>>          Issue Type: New Feature
>>>          Components: bsp, examples, graph
>>>            Reporter: Semih Salihoglu
>>>
>>>
>>> First of all, sorry for the long explanation to this feature.
>>>
>>> I want to expand the API of Giraph with a new function called
>>> master.compute(), that would get called at the master before each superstep
>>> and I will try to explain the purpose that it would serve with an example.
>>> Let's say we want to implement the following simplified version of the
>>> k-means clustering algorithm. Pseudocode below:
>>>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>>>  * Algorithm:
>>>  * int numEdgesCrossingClusters = Integer.MAX_INT;
>>> *  int iterationNo = 0;
>>>  * while ((numEdgesCrossingCluster>  numEdgesThreshold)&&  iterationNo<
>>>  maxIterations) {
>>>  *    iterationNo++;
>>>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>>>  *    findClusterCenters(G, clusterCenters);
>>>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters(**);
>>>  * }
>>> The algorithm goes through the following steps in iterations:
>>> 1) Pick k random initial cluster centers
>>> 2) Assign each vertex to the cluster center that it's closest to (in
>>> Giraph, this can be implemented in message passing similar to how
>>> ShortestPaths is implemented):
>>> 3) Count the nuimber of edges crossing clusters
>>> 4) Go back to step 1, if there are a lot of edges crossing clusters and
>>> we haven't exceeded maximum number of iterations yet.
>>>
>>> In an algorithm like this, step 2 and 3 are where most of the work
>>> happens and both parts have very neat message-passing implementations. I'll
>>> try to give an overview without going into the details. Let's say we define
>>> a Vertex in Giraph to hold a custom Writable object that holds 2 integer
>>> values and sends a message with upto 2 integer values.
>>> Step 2 is very similar to ShortestPaths algorithm and has two stages: In
>>> the first stage, each vertex checks to see whether or not it's one of the
>>> cluster centers. If so, it assigns itself the value (id, 0), otherwise it
>>> assigns itself (Null, Null). In the 2nd stage, the vertices assign
>>> themselves to the minimum distance cluster center by looking at their
>>> neighbors (cluster centers, distance) values (received as 2 integer
>>> messages) and their current values, and changing their values if they find
>>> a lower distance cluster center. This happens in x number of supersteps
>>> until every vertex converges.
>>> Step 3, counting the number of edges crossing clusters, is also very
>>> easy to implement in Giraph. Once each vertex has a cluster center, the
>>> number of edges crossing clusters can be counted by an aggregator, let's
>>> say called "num-edges-crossing". It would again have two stages: First
>>> stage, every vertex just sends its cluster id to all its neighbors. Second
>>> stage, every vertex looks at their neighbors' cluster ids in the messages,
>>> and for each cluster id that is not equal to its own cluster id, it
>>> increments "num-edges-crossing" by 1.
>>>
>>> The other 2 steps, step 1 and 4, are very simple sequential
>>> computations. Step 1 just picks k random vertex ids and puts it into an
>>> aggregator. Step 4 just compares "num-edges-crossing" by a threshold and
>>> also checks whether or not the algorithm has exceeded maxIterations (not
>>> supersteps but iterations of going through Steps 1-4). With the current
>>> API, it's not clear where to do these computations. There is a per worker
>>> function preSuperstep() that can be implemented, but if we decide to pick a
>>> special worker, let's say worker 1,  to pick the k vertices then we'd waste
>>> an entire superstep where only worker 1 would do work, (by picking k
>>> vertices  in preSuperstep() and put them into an aggregator), and all other
>>> workers would be idle. Trying to do this in worker 1 in postSuperstep()
>>> would not work either because, worker 1 needs to know that all the vertices
>>> have converged to understand that it's time to pick k vertices or it's time
>>> do check in step 4, which would only be available to it in the beginning of
>>> the next superstep.
>>>
>>> A master.compute() extension would run at the master and before the
>>> superstep and would modify the aggregator that would keep the k vertices
>>> before the aggregators are broadcast to the workers, which are all very
>>> short sequential computations, so they would not waste resources the way a
>>> preSuperstep() or postSuperstep() approach would do. It would also enable
>>> running new algorithms like kmeans that are composed of very vertex-centric
>>> computations glued together by small sequential ones. It would basically
>>> boost Giraph with sequential computation in a non-wasteful way.
>>>
>>> I am a phd student at Stanford and I have been working on my own
>>> BSP/Pregel implementation since last year. It's called GPS. I haven't
>>> distributed it, mainly because in September I learned about Giraph and I
>>> decided to slow down on working on it :). We have basically been using GPS
>>> as our own research platform. The source code for GPS is here if any one is
>>> interested (https://subversion.assembla.**com/svn/phd-projects/gps/**
>>> trunk/ <https://subversion.assembla.com/svn/phd-projects/gps/trunk/>).
>>> We have the master.compute() feature in GPS, and here's an example of
>>> KMeans implementation in GPS with master.compute(): (
>>> https://subversion.assembla.**com/svn/phd-projects/gps/**
>>> trunk/src/java/gps/examples/**kmeans/<https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/>).
>>> (Aggregators are called GlobalObjects in GPS). There is another example (
>>> https://subversion.assembla.**com/svn/phd-projects/gps/**
>>> trunk/src/java/gps/examples/**randomgraphcoarsening/<https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/>),
>>> which I'll skip explaining because it's very detailed and would make the
>>> similar points that I am trying to make with k-means. Master.compute() in
>>> general would make it possible to glue together any graph algorithm that is
>>> composed of multiple stages with different message types and computations
>>> that is conducive to run with vertex.compute(). There are many examples of
>>> such algorithms: recursive partitioning, triangle counting, even much
>>> simpler things like finding shortests paths for 100 vertices in pieces
>>> (first to 5 vertices, then to another 5, then to another 5, etc..), which
>>> would be good because trying to find shortests paths to 100 vertices
>>> require a very large messages (would need to store 100 integers per
>>> message)).
>>>
>>> If the Giraph team approves, I would like to take a similar approach in
>>> implementing this feature in Giraph as I've done in GPS. Overall:
>>> Add a Master.java to org.apache.giraph.graph, that is default Master,
>>> with a compute function that by default aggregates all aggregators and does
>>> the check of whether or not the computation has ended (by comparining
>>> numVertices with numFinishedVertices). This would be a refactoring of
>>> org.apache.giraph.graph.**BspServiceMaster class (as far as I can see).
>>> Extend GiraphJob to have a setMaster() method to set a master class (by
>>> default it would be the default master above)
>>> The rest would be sending the custom master class to probably all
>>> workers but only the master would instantiate it with reflection. I need to
>>> learn more on how to do these, I am not familiar with that part of the
>>> Giraph code base yet.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA
>>> administrators: https://issues.apache.org/**jira/secure/**
>>> ContactAdministrators!default.**jspa<https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa>
>>> For more information on JIRA, see: http://www.atlassian.com/**
>>> software/jira <http://www.atlassian.com/software/jira>
>>>
>>>
>>>
>>
>>
>
>

Re: [jira] [Created] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by Avery Ching <ac...@apache.org>.
Not sure if Semih is on the giraph-dev list.  Forwarding the question to 
him.

Avery

P.S.  Interesting idea if I understand correctly, attaching the compute 
functionality to an aggregator that the master will run between supersteps?

On 1/19/12 1:20 PM, Claudio Martella wrote:
> Hi Semih,
>
> interesting email. I'm probably not getting your technique right, but
> why wouldn't it be possible to compute the master.compute() inside of
> an aggregator?
>
> Not only it *should* be possible, but as aggregators are computed both
> on workers AND on the master, you should have a faster computation.
> for instance you could aggregate the number of cut edges on each
> worker and aggregate the total number on the master. Same could happen
> for choosing the centroids.
>
> On Thu, Jan 19, 2012 at 9:52 PM, Semih Salihoglu (Created) (JIRA)
> <ji...@apache.org>  wrote:
>> Extending the API with a master.compute() function.
>> ---------------------------------------------------
>>
>>                  Key: GIRAPH-127
>>                  URL: https://issues.apache.org/jira/browse/GIRAPH-127
>>              Project: Giraph
>>           Issue Type: New Feature
>>           Components: bsp, examples, graph
>>             Reporter: Semih Salihoglu
>>
>>
>> First of all, sorry for the long explanation to this feature.
>>
>> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>>   * Input G(V, E), k, numEdgesThreshold, maxIterations
>>   * Algorithm:
>>   * int numEdgesCrossingClusters = Integer.MAX_INT;
>> *  int iterationNo = 0;
>>   * while ((numEdgesCrossingCluster>  numEdgesThreshold)&&  iterationNo<  maxIterations) {
>>   *    iterationNo++;
>>   *    int[] clusterCenters = pickKClusterCenters(k, G);
>>   *    findClusterCenters(G, clusterCenters);
>>   *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>>   * }
>> The algorithm goes through the following steps in iterations:
>> 1) Pick k random initial cluster centers
>> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
>> 3) Count the nuimber of edges crossing clusters
>> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
>>
>> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
>> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
>> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
>>
>> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
>>
>> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
>>
>> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
>>
>> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
>> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
>> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
>> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>
>



Re: [jira] [Created] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by Claudio Martella <cl...@gmail.com>.
Hi Semih,

interesting email. I'm probably not getting your technique right, but
why wouldn't it be possible to compute the master.compute() inside of
an aggregator?

Not only it *should* be possible, but as aggregators are computed both
on workers AND on the master, you should have a faster computation.
for instance you could aggregate the number of cut edges on each
worker and aggregate the total number on the master. Same could happen
for choosing the centroids.

On Thu, Jan 19, 2012 at 9:52 PM, Semih Salihoglu (Created) (JIRA)
<ji...@apache.org> wrote:
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>
>
> First of all, sorry for the long explanation to this feature.
>
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
>
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
>
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
>
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
>
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
>
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment: GIRAPH-127.patch
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403405#comment-13403405 ] 

Jakob Homan commented on GIRAPH-127:
------------------------------------

+1
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13294258#comment-13294258 ] 

Avery Ching commented on GIRAPH-127:
------------------------------------

Hey, looks pretty clean so far Jan.

>The API of MasterCompute.java. I changed voteToHalt() to haltComputation(), since the master doesn't really vote, but can end the computation by itself. The rest of the API is similar to that of a BasicVertex.

I like this.

>How will we combine MasterCompute / WorkerContext? WorkerContext should only run on the workers, following from its name, but since preApplication() is the place where aggregators are generally registered, the Master will have to do this again seperately in this case, which is not very elegant. The master does not have to run useAggregator, since it uses all aggregators by default.

I'm a little unclear on the meaning here.  I'm assuming that the aggregators from MasterCompute would be available prior to preApplication.  Am I missing something?

>Given the issue above, MasterCompute can not use aggregators at superstep 0 at this point, since they are not registered. Starting with superstep 1, they have been collected from the workers and the master can manipulate them. getSuperstep(), getNumVertices(), and getNumEdges() work correctly, as far as I know.

This seems reasonable.

                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Assigned] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Avery Ching (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Avery Ching reassigned GIRAPH-127:
----------------------------------

    Assignee: Semih Salihoglu

Looking forward to this.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403414#comment-13403414 ] 

Hudson commented on GIRAPH-127:
-------------------------------

Integrated in Giraph-trunk-Commit #127 (See [https://builds.apache.org/job/Giraph-trunk-Commit/127/])
    GIRAPH-127: Extending the API with a master.compute() function.  Contributed by Jan van der Lugt. (Revision 1355128)

     Result = SUCCESS
jghoman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1355128
Files : 
* /giraph/trunk/CHANGELOG
* /giraph/trunk/src/main/java/org/apache/giraph/benchmark/RandomMessageBenchmark.java
* /giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedServiceMaster.java
* /giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleMasterComputeVertex.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/DefaultMasterCompute.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GlobalStats.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/MasterCompute.java
* /giraph/trunk/src/main/java/org/apache/giraph/graph/MasterThread.java
* /giraph/trunk/src/test/java/org/apache/giraph/TestBspBasic.java

                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment: GIRAPH-127.patch
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400849#comment-13400849 ] 

Jakob Homan commented on GIRAPH-127:
------------------------------------

bq. The approach I've chosen now is to require that the master.compute() registers the aggregators seperately from the workers (who use the workercontext).
+1

Overall, this is looking good.  Some commented out code and no tests yet, but the approach is coming along well.  Rather than a default, no-op implementation, can we just not run the mastercompute function when it's not defined? Some more javadoc in MasterCompute to better explain its intention for those who won't be reading this JIRA would be good.   Jan, want to come up with a more complete patch?  There's definitely consensus for adding this into Giraph.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295328#comment-13295328 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

I'll clarify my second point:

As it is now, only workers execute the functions in the workercontext. The master does not, but it runs before it starts the first superstep on the workers. Given these two facts, the first execution of the mastercompute runs before preApplication. Since aggregators are usually registered in preApplication, the master cannot use the aggregators to send a value to the workers. A solution could be to run preApplication and postApplication on the master as well, or to require the master to register the aggregators it uses inside the mastercompute. I don't know which one of these solutions I find more elegant, they both have their drawbacks.

Hope it's clear this way, otherwise, let me know!
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399494#comment-13399494 ] 

Jakob Homan commented on GIRAPH-127:
------------------------------------

I'll take a look first thing Monday.  Also, please do file a separate JIRA.  Thanks.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127-v1.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment: GIRAPH-127.patch
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127-v1.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292040#comment-13292040 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

Attached is a first version of the master.compute. It is not done: there are no tests, the master state should be persisted on checkpoints, etc.
There are a few things I would like feedback on at this point:
- The API of MasterCompute.java. I changed voteToHalt() to haltComputation(), since the master doesn't really vote, but can end the computation by itself. The rest of the API is similar to that of a BasicVertex.
- How will we combine MasterCompute / WorkerContext? WorkerContext should only run on the workers, following from its name, but since preApplication() is the place where aggregators are generally registered, the Master will have to do this again seperately in this case, which is not very elegant. The master does not have to run useAggregator, since it uses all aggregators by default.
- Given the issue above, MasterCompute can not use aggregators at superstep 0 at this point, since they are not registered. Starting with superstep 1, they have been collected from the workers and the master can manipulate them. getSuperstep(), getNumVertices(), and getNumEdges() work correctly, as far as I know.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Avery Ching (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189574#comment-13189574 ] 

Avery Ching commented on GIRAPH-127:
------------------------------------

I think this functionality is very useful and would actually replace a lot of the WorkerContext functionality. Sequential steps do need to be done between computations sometimes and "Pick k random initial cluster centers" is a good example.

While WorkerContext allows us to do simple things, it is not as efficient for certain calculations (i.e. suppose all workers needed a global value from HDFS, it is cheaper to do once and broadcast the outcome rather than all workers hitting HDFS). Still, WorkerContext can be useful (say for dumping worker stats), so I wouldn't remove it, rather just give our users a broader choice on computation around supersteps. 

I see that the Master#compute() should have access to all aggregators to do its work.  Overall, I like the idea and would definitely like to see how we can add this in. Getting the interface right will be a little hard I think, but we can iterate over it.  

Basically, from what Semih has said is that we gain 
1)  A clean way to do sequential computation between supersteps
2)  Removing the extra superstep if we simulate this idea with a 'picked worker'
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment: GIRAPH-127-v1.patch
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Eli Reisman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403702#comment-13403702 ] 

Eli Reisman commented on GIRAPH-127:
------------------------------------

The build is failing tonight after all the newly committed patches. I rolled back my repo to the GIRAPH-219 commit and patched in this commit and the GIRAPH-213 and the tests all pass again. So I think this is not the culprit. I'll try a few more tonight if I have time and see which one breaks.

                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399490#comment-13399490 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

Ping!

I finished and tested the master.compute() function, it seems to work pretty well. The approach I've chosen now is to require that the master.compute() registers the aggregators seperately from the workers (who use the workercontext). I've also tried executing the workercontext.preapplication on the master, which works as well but seems like a bad plan, as it can have side-effect that the user might not expect. I've added a new patch, please let me know what you think!

I've also fixed this issue I found on the mailing list (and ran into myself): http://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201112.mbox/%3C4EDED788.4020904@apache.org%3E . The patch is only a few lines, I could file a separate JIRA for it.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402734#comment-13402734 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

The latest version comes with some documentation in the MasterCompute class, an example, a unit test which makes use of the example, and some fixes regarding how aggregators are aggregated on the master. Please let me know if it needs more work or whether it is ready to be merged. It passes mvn verify and has no checkstyle errors.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Eugene Koontz (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eugene Koontz updated GIRAPH-127:
---------------------------------

    Fix Version/s: 0.2.0
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127-v1.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290234#comment-13290234 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

I have taken this work over from Semih, since he is very busy at the moment and I need it for my master thesis. I hope to finish this in a matter of weeks, at the moment I'm getting to know the internals of Giraph a little better.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment: GIRAPH-127.patch
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401517#comment-13401517 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

Thanks! The commented out code was to try and use the WorkerContext on the Master, which seems to work. I prefer the no-op implementation (similar to the no-op WorkerContext), because it makes code simpler and easier to read. Also, it's overhead is in the order of 1 function call for every superstep, otherwise there would be a constant number of conditionals for every superstep. It'll probably be optimized away anyway. I'll add more javadoc to the MasterCompute and make sure the master is logically running before the workers, because at the moment it's running after the workers. Don't exactly know how I would write unit tests for this, if someone could help me with those, that would be great.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan van der Lugt updated GIRAPH-127:
------------------------------------

    Attachment:     (was: GIRAPH-127-v1.patch)
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290126#comment-13290126 ] 

Sebastian Schelter commented on GIRAPH-127:
-------------------------------------------

Any update on the work on this? I'd also need the functionality :)
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Semih Salihoglu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267657#comment-13267657 ] 

Semih Salihoglu commented on GIRAPH-127:
----------------------------------------

Avery I promise I'll get to this. It's been very crazy. I have quals coming
up. I'll be starting an internship at MSR late June. That's when I'll look
at this. I know it's a big embarrassment that I make this wait so long :)



                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jan van der Lugt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402751#comment-13402751 ] 

Jan van der Lugt commented on GIRAPH-127:
-----------------------------------------

Latest patch has a really small addition to the example and unit test, in order to ensure testing that the aggregator values work in superstep 0 as well.
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Assigned] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Jakob Homan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jakob Homan reassigned GIRAPH-127:
----------------------------------

    Assignee: Jan van der Lugt  (was: Semih Salihoglu)
    
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Jan van der Lugt
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-127.patch, GIRAPH-127.patch
>
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (GIRAPH-127) Extending the API with a master.compute() function.

Posted by "Benjamin Heitmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/GIRAPH-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267716#comment-13267716 ] 

Benjamin Heitmann commented on GIRAPH-127:
------------------------------------------

I also have an algorithm which can benefit a lot from such a master function. 
Probably most/any global termination condition of an algorithm can be checked most efficiently by a master function. 
                
> Extending the API with a master.compute() function.
> ---------------------------------------------------
>
>                 Key: GIRAPH-127
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-127
>             Project: Giraph
>          Issue Type: New Feature
>          Components: bsp, examples, graph
>            Reporter: Semih Salihoglu
>            Assignee: Semih Salihoglu
>
> First of all, sorry for the long explanation to this feature.
> I want to expand the API of Giraph with a new function called master.compute(), that would get called at the master before each superstep and I will try to explain the purpose that it would serve with an example. Let's say we want to implement the following simplified version of the k-means clustering algorithm. Pseudocode below:
>  * Input G(V, E), k, numEdgesThreshold, maxIterations
>  * Algorithm:
>  * int numEdgesCrossingClusters = Integer.MAX_INT;
> *  int iterationNo = 0;
>  * while ((numEdgesCrossingCluster > numEdgesThreshold) && iterationNo < maxIterations) {
>  *    iterationNo++;
>  *    int[] clusterCenters = pickKClusterCenters(k, G);
>  *    findClusterCenters(G, clusterCenters);
>  *    numEdgesCrossingClusters = countNumEdgesCrossingClusters();
>  * }
> The algorithm goes through the following steps in iterations:
> 1) Pick k random initial cluster centers
> 2) Assign each vertex to the cluster center that it's closest to (in Giraph, this can be implemented in message passing similar to how ShortestPaths is implemented):
> 3) Count the nuimber of edges crossing clusters
> 4) Go back to step 1, if there are a lot of edges crossing clusters and we haven't exceeded maximum number of iterations yet.
> In an algorithm like this, step 2 and 3 are where most of the work happens and both parts have very neat message-passing implementations. I'll try to give an overview without going into the details. Let's say we define a Vertex in Giraph to hold a custom Writable object that holds 2 integer values and sends a message with upto 2 integer values.
> Step 2 is very similar to ShortestPaths algorithm and has two stages: In the first stage, each vertex checks to see whether or not it's one of the cluster centers. If so, it assigns itself the value (id, 0), otherwise it assigns itself (Null, Null). In the 2nd stage, the vertices assign themselves to the minimum distance cluster center by looking at their neighbors (cluster centers, distance) values (received as 2 integer messages) and their current values, and changing their values if they find a lower distance cluster center. This happens in x number of supersteps until every vertex converges.
> Step 3, counting the number of edges crossing clusters, is also very easy to implement in Giraph. Once each vertex has a cluster center, the number of edges crossing clusters can be counted by an aggregator, let's say called "num-edges-crossing". It would again have two stages: First stage, every vertex just sends its cluster id to all its neighbors. Second stage, every vertex looks at their neighbors' cluster ids in the messages, and for each cluster id that is not equal to its own cluster id, it increments "num-edges-crossing" by 1.
> The other 2 steps, step 1 and 4, are very simple sequential computations. Step 1 just picks k random vertex ids and puts it into an aggregator. Step 4 just compares "num-edges-crossing" by a threshold and also checks whether or not the algorithm has exceeded maxIterations (not supersteps but iterations of going through Steps 1-4). With the current API, it's not clear where to do these computations. There is a per worker function preSuperstep() that can be implemented, but if we decide to pick a special worker, let's say worker 1,  to pick the k vertices then we'd waste an entire superstep where only worker 1 would do work, (by picking k vertices  in preSuperstep() and put them into an aggregator), and all other workers would be idle. Trying to do this in worker 1 in postSuperstep() would not work either because, worker 1 needs to know that all the vertices have converged to understand that it's time to pick k vertices or it's time do check in step 4, which would only be available to it in the beginning of the next superstep.
> A master.compute() extension would run at the master and before the superstep and would modify the aggregator that would keep the k vertices before the aggregators are broadcast to the workers, which are all very short sequential computations, so they would not waste resources the way a preSuperstep() or postSuperstep() approach would do. It would also enable running new algorithms like kmeans that are composed of very vertex-centric computations glued together by small sequential ones. It would basically boost Giraph with sequential computation in a non-wasteful way.
> I am a phd student at Stanford and I have been working on my own BSP/Pregel implementation since last year. It's called GPS. I haven't distributed it, mainly because in September I learned about Giraph and I decided to slow down on working on it :). We have basically been using GPS as our own research platform. The source code for GPS is here if any one is interested (https://subversion.assembla.com/svn/phd-projects/gps/trunk/). We have the master.compute() feature in GPS, and here's an example of KMeans implementation in GPS with master.compute(): (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/kmeans/). (Aggregators are called GlobalObjects in GPS). There is another example (https://subversion.assembla.com/svn/phd-projects/gps/trunk/src/java/gps/examples/randomgraphcoarsening/), which I'll skip explaining because it's very detailed and would make the similar points that I am trying to make with k-means. Master.compute() in general would make it possible to glue together any graph algorithm that is composed of multiple stages with different message types and computations that is conducive to run with vertex.compute(). There are many examples of such algorithms: recursive partitioning, triangle counting, even much simpler things like finding shortests paths for 100 vertices in pieces (first to 5 vertices, then to another 5, then to another 5, etc..), which would be good because trying to find shortests paths to 100 vertices require a very large messages (would need to store 100 integers per message)).
> If the Giraph team approves, I would like to take a similar approach in implementing this feature in Giraph as I've done in GPS. Overall:
> Add a Master.java to org.apache.giraph.graph, that is default Master, with a compute function that by default aggregates all aggregators and does the check of whether or not the computation has ended (by comparining numVertices with numFinishedVertices). This would be a refactoring of org.apache.giraph.graph.BspServiceMaster class (as far as I can see).
> Extend GiraphJob to have a setMaster() method to set a master class (by default it would be the default master above)
> The rest would be sending the custom master class to probably all workers but only the master would instantiate it with reflection. I need to learn more on how to do these, I am not familiar with that part of the Giraph code base yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira