You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Neal Clark <nc...@gmail.com> on 2010/05/31 22:48:40 UTC

Distributed Graph Algorithms

Hi,

As part of my research for my masters thesis I have developed a number
of distributed MapReudce graph algorithms. If you are interested I
have both connectivity and minimum spanning tree algorithms which I
would be willing to contribute. Using the techniques developed for
these algorithms it would be easy to implement a large variety of
other graph algorithms. Please let me know if you are interested.

Thanks,

Neal.

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@gmail.com>.

I have uploaded a very basic version of my connectivity MapReduce job here
http://pastebin.org/322216

<http://pastebin.org/322216>The algorithm takes an edge list consists of
integer vertex names.

<v1>\t<v2>\n
<v2>\t<v1>\n

The current algorithms modifies the graph by adding and removing edges such
so that each vertex eventually connects only too its smallest numbered
neighbour.

In each phase the current vertex finds the smallest numbered vertex among
itself and its neighbours. It then creates edges for each of its neighbours
to the smallest known vertex. It also creates a final edge between the
smallest known vertex back to itself. This ensures that if the smallest
known vertex learns of a smaller vertex it can propagate this information to
the rest of the graph.

The algorithm is designed to extract the maximum amount of information from
the graph. The output data can easily be processed to create list of
vertexes and which component it belongs to.

Arbitrary graphs can easily be converted to the necessary integer pair edge
list format. If the graph consists of text labelled vertexes a random
3*log2(n) bit integer can be assigned to each vertex in parallel. Where n is
the number of vertexes in the initial graph. using 3*log2(n) bits ensures
that it is extremely unlikely that two vertexes will receive the same
integer label.

I am currently looking for a good way to generate large graphs or get access
to large graph datasets. Since the number of vertexes usually far exceeds
the number of reduce nodes the overloading a single vertex shouldn't be too
much of a problem. More testing is needed to confirm this. If overloading
does prove to be a problem we can use a two phase approach to determine the
min vertexes.

Open questions:
1) What input formats should be supported?
2) Do you have any suggestions on what intermediary format could be used
between phases?
3) How best to approach integrating these algorithms into Mahout?
4) Does anyone know where I can find some large test graphs?
5) Do you think that this type of algorithm is a good fit for Mahout?

Thanks,

Neal.

On Mon, May 31, 2010 at 2:54 PM, Ted Dunning <te...@gmail.com> wrote:

> To help you get started, we have a collection of sparse matrix structures,
> some of which are amenable to row-wise distribution to mappers in
> map-reduce
> programs.   If your connectivity program is basically just the transitive
> closure of the graph, then that would probably suffice (although I would
> worry about the output getting large).  The MST algorithm will probably
> stress things a bit more.
>
> On Mon, May 31, 2010 at 2:47 PM, Neal Clark <nc...@uvic.ca> wrote:
>
> > I will have to take a closer look at the Mahout data structures before
> > I can be certain how hard it would be.
> >
>

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@uvic.ca>.

I just wanted to provide an update. I have rewritten the connectivity
algorithm to use a sorting/streaming model. This approach eliminates the
memory bottlenecks I was running into with my previous approach.

I just finished running the new algorithm on the twitter dataset.

Nodes: 160
MapJobs: 640

Iterations: 6
Total time: 63m 32s
Average time per iteration: 10m 20s

It appears that the twitter graph is connected however I will need to better
test my algorithm before I make that claim. ;)

I have also uploaded copies of the hadoop job-tracker pages.

http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0017%20on%20db101a.html
http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0012%20on%20db101a.html
http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0013%20on%20db101a.html
http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0014%20on%20db101a.html
http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0015%20on%20db101a.html
http://www.treadbook.com/uploads/mahout/run1/Hadoop%20job_201006150013_0016%20on%20db101a.html

There is still a lot of testing and performance tuning to be done but I will
try to keep you updated on any progress.

To prevent map input files from being split is is sufficient to extend
FileInputFormat and override the isSplitable() method? In order for the
algorithm to run correctly each mapper must receive a consecutive block from
the previous reduce phase which contains all edges for a particular vertex.
That is that I need to ensure that vertexes are not being split across
multiple mappers.

Thanks,

Neal.

On Fri, Jun 11, 2010 at 9:23 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> On 6/10/10 6:22 PM, Ted Dunning wrote:
>
>> I think I understand the problem, but I haven't been reading super
>> carefully
>> as you describe your algorithm.
>>
>> The basic idea, though, is pretty simple.  You are producing higher and
>> higher powers of the adjacency matrix while labeling each connected
>> component with the lowest component.
>>
>> The algorithm as you described it sounds like you are keeping around the
>> entire matrix with appropriate labeling.  An alternative is to reduce the
>> matrix each time that you discover that a connected component has merged
>> with another.  That would mean that the graph you are processing would
>> decrease in size on each iteration terminating when it has a single node
>> for
>> each connected sub-graph.  The problem with that approach is remembering
>> the
>> labeling of each node in the original graph.  One way to do that fairly
>> efficiently without an explosion in the amount of data being carried
>> around
>> would be create a set of side files at each step that contain the mapping
>> from nodes in one generation to their label in the next generation.  If
>> you
>> have k steps of reduction (k \le log_2 N where N is the size of the graph,
>> I
>> think), then you would have k side mapping files.  After you converge in
>> the
>> merging process, you can run k joins to reverse the process and propagate
>> the labels from the last generation back to the first.  The first set of k
>> map-reduces should run progressively faster as the graph collapses and the
>> second set of k map-reduces in which you do the joins should run
>> progressively faster as the number of labels being processed increases.
>>
>>
> I've been reading this thread with interest as it may offer some
> scalability improvements to the current MeanShiftCanopy clustering
> implementation. That code currently runs a sequence of
> Canopy-clustering-like iterations to build, bottom-up, a graph of clusters
> which contain the identities of the points they have subsumed as their
> centers shift towards their local centers of maximum density and they merge
> together. These clusters can become very large objects as the list of
> subsumed ids grows and this becomes an ultimate scalability limitation. I've
> been stewing about an approach similar to what is above to write out the
> mergers in each iteration to a side-mapping file. It would run pretty much
> like the last two sentences above.
>

Re: Distributed Graph Algorithms

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

On 6/10/10 6:22 PM, Ted Dunning wrote:
> I think I understand the problem, but I haven't been reading super carefully
> as you describe your algorithm.
>
> The basic idea, though, is pretty simple.  You are producing higher and
> higher powers of the adjacency matrix while labeling each connected
> component with the lowest component.
>
> The algorithm as you described it sounds like you are keeping around the
> entire matrix with appropriate labeling.  An alternative is to reduce the
> matrix each time that you discover that a connected component has merged
> with another.  That would mean that the graph you are processing would
> decrease in size on each iteration terminating when it has a single node for
> each connected sub-graph.  The problem with that approach is remembering the
> labeling of each node in the original graph.  One way to do that fairly
> efficiently without an explosion in the amount of data being carried around
> would be create a set of side files at each step that contain the mapping
> from nodes in one generation to their label in the next generation.  If you
> have k steps of reduction (k \le log_2 N where N is the size of the graph, I
> think), then you would have k side mapping files.  After you converge in the
> merging process, you can run k joins to reverse the process and propagate
> the labels from the last generation back to the first.  The first set of k
> map-reduces should run progressively faster as the graph collapses and the
> second set of k map-reduces in which you do the joins should run
> progressively faster as the number of labels being processed increases.
>    
I've been reading this thread with interest as it may offer some 
scalability improvements to the current MeanShiftCanopy clustering 
implementation. That code currently runs a sequence of 
Canopy-clustering-like iterations to build, bottom-up, a graph of 
clusters which contain the identities of the points they have subsumed 
as their centers shift towards their local centers of maximum density 
and they merge together. These clusters can become very large objects as 
the list of subsumed ids grows and this becomes an ultimate scalability 
limitation. I've been stewing about an approach similar to what is above 
to write out the mergers in each iteration to a side-mapping file. It 
would run pretty much like the last two sentences above.

Re: Distributed Graph Algorithms

Posted by Ted Dunning <te...@gmail.com>.

I think I understand the problem, but I haven't been reading super carefully
as you describe your algorithm.

The basic idea, though, is pretty simple.  You are producing higher and
higher powers of the adjacency matrix while labeling each connected
component with the lowest component.

The algorithm as you described it sounds like you are keeping around the
entire matrix with appropriate labeling.  An alternative is to reduce the
matrix each time that you discover that a connected component has merged
with another.  That would mean that the graph you are processing would
decrease in size on each iteration terminating when it has a single node for
each connected sub-graph.  The problem with that approach is remembering the
labeling of each node in the original graph.  One way to do that fairly
efficiently without an explosion in the amount of data being carried around
would be create a set of side files at each step that contain the mapping
from nodes in one generation to their label in the next generation.  If you
have k steps of reduction (k \le log_2 N where N is the size of the graph, I
think), then you would have k side mapping files.  After you converge in the
merging process, you can run k joins to reverse the process and propagate
the labels from the last generation back to the first.  The first set of k
map-reduces should run progressively faster as the graph collapses and the
second set of k map-reduces in which you do the joins should run
progressively faster as the number of labels being processed increases.

Doesn't this approach avoid the problems you mention with lots of labels
being passed around all the time?

The optimization that I would apply to this algorithm (someday as you
mention) is to detect when the data becomes small enough to process
in-memory on a single machine.  This would avoid the overhead of invoking a
map-reduce program.  My guess is that you should be able to process a graph
with up to 10^7 or 10^8 nodes in memory faster than you can with a
multi-node map-reduce program.

My suggestion

On Thu, Jun 10, 2010 at 6:09 PM, Neal Clark <nc...@uvic.ca> wrote:

> The  problem is too many duplicate edges but the fact that as the algorithm
> grows the number of edges adjacent to the minim vertexes increases. In a
> worst case scenario the graph is completely connected and only has a single
> component. So far I have not figured out an efficient method of
> partitioning
> the edges of a single vertex while still passing on the minimum vertex to
> each partition.
>

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@uvic.ca>.

The  problem is too many duplicate edges but the fact that as the algorithm
grows the number of edges adjacent to the minim vertexes increases. In a
worst case scenario the graph is completely connected and only has a single
component. So far I have not figured out an efficient method of partitioning
the edges of a single vertex while still passing on the minimum vertex to
each partition.

The twitter dataset has approximately 44 million vertexes. Even with a graph
this large the time necessary to process 44 million edges isn't too long.
One advantage to the sorting approach is that duplicate edges become trivial
to detect.

Once we are able to process the twitter dataset we can work on further
optimizations.

Thanks,

Neal.

On Thu, Jun 10, 2010 at 5:31 PM, Ted Dunning <te...@gmail.com> wrote:

> Would a combiner help?
>
> On Thu, Jun 10, 2010 at 5:27 PM, Neal Clark <nc...@uvic.ca> wrote:
>
> > Using this approach it should be possible to use do the bulk of the
> > algorithm in the Mapper and then use the SecondarySort Reducer to sort
> the
> > output.
> >
>

Re: Distributed Graph Algorithms

Posted by Ted Dunning <te...@gmail.com>.

Would a combiner help?

On Thu, Jun 10, 2010 at 5:27 PM, Neal Clark <nc...@uvic.ca> wrote:

> Using this approach it should be possible to use do the bulk of the
> algorithm in the Mapper and then use the SecondarySort Reducer to sort the
> output.
>

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@uvic.ca>.

On Wed, Jun 9, 2010 at 5:12 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, Jun 9, 2010 at 5:04 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> >  > Open questions:
> > > 1) What input formats should be supported?
> >
>
> Your text input format is good, and fairly standard, actually.  Another
> would be
> something like SequenceFile<IntWritable,IntWritable>, which is basically
> what your current output format looks like!

With large datasets I think using a binary intermediary format would provide
much better results. I will experiment with the SequenceFile and see how it
performs.

>
> > > 2) Do you have any suggestions on what intermediary format could be
> used
> > > between phases?
> > >
> >
> > These should be sequence files of some kind.  Using the Mahout vector
> > format
> > would probably work well at the cost of a bit of overhead due to using
> > doubles to store integers.
> >
>
> Yeah, we really should extend at some point to allowing ints and booleans
> too.
> But then again, double is only double the size of an int.
> It's not like it's a *huge* factor.
>
>
>
> > > 3) How best to approach integrating these algorithms into Mahout?
> > >
> >
> > you are breaking new ground here with graph algorithms in mahout.
> >
>
> I agree - do what you feel comfortable with, we don't have anything
> currently on this.
>
>
> > >  4) Does anyone know where I can find some large test graphs?
> > >
> >
> > Consider the wikipedia link graph.  Also interesting might be the
> > cooccurrence graph of words in a large corpus.  The twitter social graph
> > might be interesting as well.
> >
>

>
>
 The twitter social graph is pretty humongous - you can get the torrent
> here: http://an.kaist.ac.kr/traces/WWW2010.html
> And I've got it hiding on Amazon S3 too, ask me offline if you want
> access to that one.
>

Thanks for the link to the twitter dataset. I have been running the
connectivity algorithm on a 40 node blade cluster with 4 core 2.66 Ghz Xeon
processors in each blade.  It has become pretty clear that I need to
slightly rethink my approach. This doesn't surprise me in the least.

Instead of finding the min vertex in memory I am going to try to use the
SecondarySort implementation, from the Hadoop examples, to sort the edges by
first and second vertex, grouping them by v1. This would allow a Mapper to
read in the first pair of each partition and use it as the minimum vertex
therefore eliminating the need to buffer any edges. Duplicates can be
removed by checking the previous edge.

Using this approach it should be possible to use do the bulk of the
algorithm in the Mapper and then use the SecondarySort Reducer to sort the
output.

Re: Distributed Graph Algorithms

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jun 9, 2010 at 5:04 PM, Ted Dunning <te...@gmail.com> wrote:
>
>  > Open questions:
> > 1) What input formats should be supported?
>

Your text input format is good, and fairly standard, actually.  Another
would be
something like SequenceFile<IntWritable,IntWritable>, which is basically
what your current output format looks like!

> > 2) Do you have any suggestions on what intermediary format could be used
> > between phases?
> >
>
> These should be sequence files of some kind.  Using the Mahout vector
> format
> would probably work well at the cost of a bit of overhead due to using
> doubles to store integers.
>

Yeah, we really should extend at some point to allowing ints and booleans
too.
But then again, double is only double the size of an int.
It's not like it's a *huge* factor.

> > 3) How best to approach integrating these algorithms into Mahout?
> >
>
> you are breaking new ground here with graph algorithms in mahout.
>

I agree - do what you feel comfortable with, we don't have anything
currently on this.

> >  4) Does anyone know where I can find some large test graphs?
> >
>
> Consider the wikipedia link graph.  Also interesting might be the
> cooccurrence graph of words in a large corpus.  The twitter social graph
> might be interesting as well.
>

The twitter social graph is pretty humongous - you can get the torrent
here: http://an.kaist.ac.kr/traces/WWW2010.html
And I've got it hiding on Amazon S3 too, ask me offline if you want
access to that one.

  -jake

Re: Distributed Graph Algorithms

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Jun 9, 2010 at 4:57 PM, Neal Clark <nc...@uvic.ca> wrote:

> I am currently looking for a good way to generate large graphs or get
> access
> to large graph datasets. Since the number of vertexes usually far exceeds
> the number of reduce nodes the overloading a single vertex shouldn't be too
> much of a problem. More testing is needed to confirm this. If overloading
> does prove to be a problem we can use a two phase approach to determine the
> min vertexes.
>
> Open questions:
> 1) What input formats should be supported?
>

We don't have a good answer for that yet.  Recent discussions have talked a
lot about how to vectorize more ordinary text.

> 2) Do you have any suggestions on what intermediary format could be used
> between phases?
>

These should be sequence files of some kind.  Using the Mahout vector format
would probably work well at the cost of a bit of overhead due to using
doubles to store integers.

> 3) How best to approach integrating these algorithms into Mahout?
>

you are breaking new ground here with graph algorithms in mahout.

>  4) Does anyone know where I can find some large test graphs?
>

Consider the wikipedia link graph.  Also interesting might be the
cooccurrence graph of words in a large corpus.  The twitter social graph
might be interesting as well.

> 5) Do you think that this type of algorithm is a good fit for Mahout?
>

I do even though we haven't had much pull for graph algorithms yet.   This
could easily change.

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@uvic.ca>.

I have uploaded a very basic version of my connectivity MapReduce job here
http://pastebin.org/322216

 <http://pastebin.org/322216>The algorithm takes an edge list consists of
integer vertex names.

<v1>\t<v2>\n
<v2>\t<v1>\n

The current algorithms modifies the graph by adding and removing edges such
so that each vertex eventually connects only too its smallest numbered
neighbour.

In each phase the current vertex finds the smallest numbered vertex among
itself and its neighbours. It then creates edges for each of its neighbours
to the smallest known vertex. It also creates a final edge between the
smallest known vertex back to itself. This ensures that if the smallest
known vertex learns of a smaller vertex it can propagate this information to
the rest of the graph.

The algorithm is designed to extract the maximum amount of information from
the graph. The output data can easily be processed to create list of
vertexes and which component it belongs to.

Arbitrary graphs can easily be converted to the necessary integer pair edge
list format. If the graph consists of text labelled vertexes a random
3*log2(n) bit integer can be assigned to each vertex in parallel. Where n is
the number of vertexes in the initial graph. using 3*log2(n) bits ensures
that it is extremely unlikely that two vertexes will receive the same
integer label.

I am currently looking for a good way to generate large graphs or get access
to large graph datasets. Since the number of vertexes usually far exceeds
the number of reduce nodes the overloading a single vertex shouldn't be too
much of a problem. More testing is needed to confirm this. If overloading
does prove to be a problem we can use a two phase approach to determine the
min vertexes.

Open questions:
1) What input formats should be supported?
2) Do you have any suggestions on what intermediary format could be used
between phases?
3) How best to approach integrating these algorithms into Mahout?
4) Does anyone know where I can find some large test graphs?
5) Do you think that this type of algorithm is a good fit for Mahout?

Thanks,

Neal.

On Mon, May 31, 2010 at 2:54 PM, Ted Dunning <te...@gmail.com> wrote:

> To help you get started, we have a collection of sparse matrix structures,
> some of which are amenable to row-wise distribution to mappers in
> map-reduce
> programs.   If your connectivity program is basically just the transitive
> closure of the graph, then that would probably suffice (although I would
> worry about the output getting large).  The MST algorithm will probably
> stress things a bit more.
>
> On Mon, May 31, 2010 at 2:47 PM, Neal Clark <nc...@uvic.ca> wrote:
>
> > I will have to take a closer look at the Mahout data structures before
> > I can be certain how hard it would be.
> >
>

Re: Distributed Graph Algorithms

Posted by Ted Dunning <te...@gmail.com>.

To help you get started, we have a collection of sparse matrix structures,
some of which are amenable to row-wise distribution to mappers in map-reduce
programs.   If your connectivity program is basically just the transitive
closure of the graph, then that would probably suffice (although I would
worry about the output getting large).  The MST algorithm will probably
stress things a bit more.

On Mon, May 31, 2010 at 2:47 PM, Neal Clark <nc...@uvic.ca> wrote:

> I will have to take a closer look at the Mahout data structures before
> I can be certain how hard it would be.
>

Re: Distributed Graph Algorithms

Posted by Neal Clark <nc...@uvic.ca>.

Both algorithms execute in worst case O(log n) iterations and are
quite general. They would likely be able to server as a basis for more
complex algorithms.

I will have to take a closer look at the Mahout data structures before
I can be certain how hard it would be. However, the connectivity
algorithm should be quite trivial as it is only a few lines of code.
The MST algorithm may not fit as nicely but I will do a little
research and get back to you.

Perhaps to begin with we can work on integrating the graph
connectivity algorithm. This should allow me to become more familiar
with the Mahout code base and its developers before tackling more
complex algorithms.

Thanks,

Neal.

On Mon, May 31, 2010 at 2:00 PM, Ted Dunning <te...@gmail.com> wrote:
> These might be very interesting.
>
> How efficient are these algorithms?  How general?
>
> How hard would it be to merge these into Mahout style matrix and vector data
> structures?
>
> On Mon, May 31, 2010 at 1:48 PM, Neal Clark <nc...@gmail.com> wrote:
>
>> As part of my research for my masters thesis I have developed a number
>> of distributed MapReudce graph algorithms.
>>
>

Re: Distributed Graph Algorithms

Posted by Ted Dunning <te...@gmail.com>.

These might be very interesting.

How efficient are these algorithms?  How general?

How hard would it be to merge these into Mahout style matrix and vector data
structures?

On Mon, May 31, 2010 at 1:48 PM, Neal Clark <nc...@gmail.com> wrote:

> As part of my research for my masters thesis I have developed a number
> of distributed MapReudce graph algorithms.
>

Re: Distributed Graph Algorithms

Posted by Grant Ingersoll <gs...@apache.org>.

I'm interested in graph algorithms.  To build on what Ted says, I think the best thing you could do is put up a patch that contains some code we can look at and then we can iterate.

On May 31, 2010, at 4:48 PM, Neal Clark wrote:

> Hi,
> 
> As part of my research for my masters thesis I have developed a number
> of distributed MapReudce graph algorithms. If you are interested I
> have both connectivity and minimum spanning tree algorithms which I
> would be willing to contribute. Using the techniques developed for
> these algorithms it would be easy to implement a large variety of
> other graph algorithms. Please let me know if you are interested.
> 
> Thanks,
> 
> Neal.