You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Peter W." <pe...@marketingbrokers.com> on 2008/02/01 22:14:03 UTC

Re: graph data representation for mapreduce

Cam,

Making a directed graph in Hadoop is not
very difficult but traversing live might be
since the result is a separate file.

Basically, you kick out a destination node
as your key in the mapper and from nodes as
intermediate values. Concatenate from values in
the reducer assigning weights for each edge.

Assigned edge scores come from a computation
done in the reducer or number passed by key.

This gives a simple but weighted from/to
depiction and can be experimented with and
improved by subsequent passes or REST style
calls in the mapper for mysqldb weights.

Later,

Peter W.

Cam Bazz wrote:

> Hello,
>
> I have been long interested in storing graphs, in databases, object
> databases and lucene like indexes.
> ....
>
> Has anyone done any work on storing and processing graphs with map  
> reduce?
> If I were to start, where would I start from. I am interested in  
> finding
> shortest paths in a large graph.

Re: graph data representation for mapreduce

Posted by edward yoon <ed...@udanax.org>.

I'm not familiar with graph algorithms, but i guess it will need a
matrix, list, multi-list structures, it also will take a time
complexity.

Therefore, I think Map/Reduce can be solution even if a development
could be complicated. I also recommend a Hbase and Matrix package.

On 2/2/08, Peter W. <pe...@marketingbrokers.com> wrote:
> Cam,
>
> Making a directed graph in Hadoop is not
> very difficult but traversing live might be
> since the result is a separate file.
>
> Basically, you kick out a destination node
> as your key in the mapper and from nodes as
> intermediate values. Concatenate from values in
> the reducer assigning weights for each edge.
>
> Assigned edge scores come from a computation
> done in the reducer or number passed by key.
>
> This gives a simple but weighted from/to
> depiction and can be experimented with and
> improved by subsequent passes or REST style
> calls in the mapper for mysqldb weights.
>
> Later,
>
> Peter W.
>
> Cam Bazz wrote:
>
> > Hello,
> >
> > I have been long interested in storing graphs, in databases, object
> > databases and lucene like indexes.
> > ....
> >
> > Has anyone done any work on storing and processing graphs with map
> > reduce?
> > If I were to start, where would I start from. I am interested in
> > finding
> > shortest paths in a large graph.
>
>


-- 
B. Regards,
Edward yoon @ NHN, corp.

Re: graph data representation for mapreduce

Posted by Ted Dunning <td...@veoh.com>.

On reading what I sent here, I think maybe some amplification is in order.

The problem with problems that are arranged like matrix multiplication is
that matrix-matrix multiplication requires O(n^3) operations on O(n^2) data
elements.  If coded naively, you require O(n^3) data loads and stores which
is very bad since on modern architectures arithmetic operations are much
faster than memory operations.  With hadoop and map-reduce in general, this
problem is much, much worse.

The answer is to hold some block of data in a fast level of storage and to
perform multiple operations on each data element before letting it go.  For
conventional main memory architectures, it is common to use one or two
levels of blocking, one so that registers get re-used and another so that L1
cache gets re-used effectively.  Doing this well can mean 10 or even 100x
improvement in performance.  Similar, but not quite so impressive gains are
possible with matrix-vector products.

This issue is reflected ubiquitously in high quality numerical linear
algebra codes.  The emphasis is always on trying to avoid element by element
access, replacing them by what are called level 3 blas operations which are
basically just in-place updates using matrix-matrix multiplication as a
primitive operation.  The importance of in-place update and level 3
decomposition is largely what drove Colt to supporting views and slices so
carefully. 

With Map-reduce, these problems are much more serious since we now have a
data channel that is slower yet again.  The answer is that we have to
structure code in such a way as to move large sub-matrices around as single
data elements and operate on them using high level operations.

In my own graph-based algorithms, I use something shaped like matrix
transpose by matrix multiplication and I basically join columns of the left
product to copies of the right product.  This results in n times more data
traffic than I would like, but it allows the mappers to engage in some
serious bites of work and allows me to get pretty decent throughput.

On 2/1/08 3:48 PM, "Ted Dunning" <td...@veoh.com> wrote:

> 
> In my work, I am finding that sending around entire rows or columns of the
> adjacency graph gives substantial gains as does block decomposition of some of
> the algorithms involved.
> 
> 
> On 2/1/08 2:51 PM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
> 
>> some of our biggest map reduce jobs have been graph related (not shortest
>> path 
>> though).
>> 
>> map-reduce doesn't seem like the best computation platform for some of the
>> jobs we have had to do. Even a huge graph does not require unheard amounts of
>> memory to store as an adjacency list. but mapping (at least some) graph
>> algorithms to map-reduces causes intermediate data to bloat to enormous
>> sizes.
>> 
>> to that end we are moving away from pure map-reduce to hybrid models that
>> work 
>> in tandem with large memory banks caching the graph. The trend towards cheap
>> flash storage is a helpful factor (one that we haven't exploited yet though).
>> 
>> 
>> 
>> -----Original Message-----
>> From: Peter W. [mailto:peter@marketingbrokers.com]
>> Sent: Fri 2/1/2008 1:14 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: graph data representation for mapreduce
>>  
>> Cam,
>> 
>> Making a directed graph in Hadoop is not
>> very difficult but traversing live might be
>> since the result is a separate file.
>> 
>> Basically, you kick out a destination node
>> as your key in the mapper and from nodes as
>> intermediate values. Concatenate from values in
>> the reducer assigning weights for each edge.
>> 
>> Assigned edge scores come from a computation
>> done in the reducer or number passed by key.
>> 
>> This gives a simple but weighted from/to
>> depiction and can be experimented with and
>> improved by subsequent passes or REST style
>> calls in the mapper for mysqldb weights.
>> 
>> Later,
>> 
>> Peter W.
>> 
>> Cam Bazz wrote:
>> 
>>> Hello,
>>> 
>>> I have been long interested in storing graphs, in databases, object
>>> databases and lucene like indexes.
>>> ....
>>> 
>>> Has anyone done any work on storing and processing graphs with map
>>> reduce?
>>> If I were to start, where would I start from. I am interested in
>>> finding
>>> shortest paths in a large graph.
>> 
>>

Re: graph data representation for mapreduce

Posted by Ted Dunning <td...@veoh.com>.

On reading what I sent here, I think maybe some amplification is in order.

The problem with problems that are arranged like matrix multiplication is
that matrix-matrix multiplication requires O(n^3) operations on O(n^2) data
elements.  If coded naively, you require O(n^3) data loads and stores which
is very bad since on modern architectures arithmetic operations are much
faster than memory operations.  With hadoop and map-reduce in general, this
problem is much, much worse.

The answer is to hold some block of data in a fast level of storage and to
perform multiple operations on each data element before letting it go.  For
conventional main memory architectures, it is common to use one or two
levels of blocking, one so that registers get re-used and another so that L1
cache gets re-used effectively.  Doing this well can mean 10 or even 100x
improvement in performance.  Similar, but not quite so impressive gains are
possible with matrix-vector products.

This issue is reflected ubiquitously in high quality numerical linear
algebra codes.  The emphasis is always on trying to avoid element by element
access, replacing them by what are called level 3 blas operations which are
basically just in-place updates using matrix-matrix multiplication as a
primitive operation.  The importance of in-place update and level 3
decomposition is largely what drove Colt to supporting views and slices so
carefully. 

With Map-reduce, these problems are much more serious since we now have a
data channel that is slower yet again.  The answer is that we have to
structure code in such a way as to move large sub-matrices around as single
data elements and operate on them using high level operations.

In my own graph-based algorithms, I use something shaped like matrix
transpose by matrix multiplication and I basically join columns of the left
product to copies of the right product.  This results in n times more data
traffic than I would like, but it allows the mappers to engage in some
serious bites of work and allows me to get pretty decent throughput.

On 2/1/08 3:48 PM, "Ted Dunning" <td...@veoh.com> wrote:

> 
> In my work, I am finding that sending around entire rows or columns of the
> adjacency graph gives substantial gains as does block decomposition of some of
> the algorithms involved.
> 
> 
> On 2/1/08 2:51 PM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
> 
>> some of our biggest map reduce jobs have been graph related (not shortest
>> path 
>> though).
>> 
>> map-reduce doesn't seem like the best computation platform for some of the
>> jobs we have had to do. Even a huge graph does not require unheard amounts of
>> memory to store as an adjacency list. but mapping (at least some) graph
>> algorithms to map-reduces causes intermediate data to bloat to enormous
>> sizes.
>> 
>> to that end we are moving away from pure map-reduce to hybrid models that
>> work 
>> in tandem with large memory banks caching the graph. The trend towards cheap
>> flash storage is a helpful factor (one that we haven't exploited yet though).
>> 
>> 
>> 
>> -----Original Message-----
>> From: Peter W. [mailto:peter@marketingbrokers.com]
>> Sent: Fri 2/1/2008 1:14 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: graph data representation for mapreduce
>>  
>> Cam,
>> 
>> Making a directed graph in Hadoop is not
>> very difficult but traversing live might be
>> since the result is a separate file.
>> 
>> Basically, you kick out a destination node
>> as your key in the mapper and from nodes as
>> intermediate values. Concatenate from values in
>> the reducer assigning weights for each edge.
>> 
>> Assigned edge scores come from a computation
>> done in the reducer or number passed by key.
>> 
>> This gives a simple but weighted from/to
>> depiction and can be experimented with and
>> improved by subsequent passes or REST style
>> calls in the mapper for mysqldb weights.
>> 
>> Later,
>> 
>> Peter W.
>> 
>> Cam Bazz wrote:
>> 
>>> Hello,
>>> 
>>> I have been long interested in storing graphs, in databases, object
>>> databases and lucene like indexes.
>>> ....
>>> 
>>> Has anyone done any work on storing and processing graphs with map
>>> reduce?
>>> If I were to start, where would I start from. I am interested in
>>> finding
>>> shortest paths in a large graph.
>> 
>>

Re: graph data representation for mapreduce

Posted by Ted Dunning <td...@veoh.com>.

In my work, I am finding that sending around entire rows or columns of the
adjacency graph gives substantial gains as does block decomposition of some
of the algorithms involved.


On 2/1/08 2:51 PM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> some of our biggest map reduce jobs have been graph related (not shortest path
> though).
> 
> map-reduce doesn't seem like the best computation platform for some of the
> jobs we have had to do. Even a huge graph does not require unheard amounts of
> memory to store as an adjacency list. but mapping (at least some) graph
> algorithms to map-reduces causes intermediate data to bloat to enormous sizes.
> 
> to that end we are moving away from pure map-reduce to hybrid models that work
> in tandem with large memory banks caching the graph. The trend towards cheap
> flash storage is a helpful factor (one that we haven't exploited yet though).
> 
> 
> 
> -----Original Message-----
> From: Peter W. [mailto:peter@marketingbrokers.com]
> Sent: Fri 2/1/2008 1:14 PM
> To: core-user@hadoop.apache.org
> Subject: Re: graph data representation for mapreduce
>  
> Cam,
> 
> Making a directed graph in Hadoop is not
> very difficult but traversing live might be
> since the result is a separate file.
> 
> Basically, you kick out a destination node
> as your key in the mapper and from nodes as
> intermediate values. Concatenate from values in
> the reducer assigning weights for each edge.
> 
> Assigned edge scores come from a computation
> done in the reducer or number passed by key.
> 
> This gives a simple but weighted from/to
> depiction and can be experimented with and
> improved by subsequent passes or REST style
> calls in the mapper for mysqldb weights.
> 
> Later,
> 
> Peter W.
> 
> Cam Bazz wrote:
> 
>> Hello,
>> 
>> I have been long interested in storing graphs, in databases, object
>> databases and lucene like indexes.
>> ....
>> 
>> Has anyone done any work on storing and processing graphs with map
>> reduce?
>> If I were to start, where would I start from. I am interested in
>> finding
>> shortest paths in a large graph.
> 
>

RE: graph data representation for mapreduce

Posted by Joydeep Sen Sarma <js...@facebook.com>.

some of our biggest map reduce jobs have been graph related (not shortest path though).

map-reduce doesn't seem like the best computation platform for some of the jobs we have had to do. Even a huge graph does not require unheard amounts of memory to store as an adjacency list. but mapping (at least some) graph algorithms to map-reduces causes intermediate data to bloat to enormous sizes.

to that end we are moving away from pure map-reduce to hybrid models that work in tandem with large memory banks caching the graph. The trend towards cheap flash storage is a helpful factor (one that we haven't exploited yet though).

-----Original Message-----
From: Peter W. [mailto:peter@marketingbrokers.com]
Sent: Fri 2/1/2008 1:14 PM
To: core-user@hadoop.apache.org
Subject: Re: graph data representation for mapreduce

Cam,

Making a directed graph in Hadoop is not
very difficult but traversing live might be
since the result is a separate file.

Basically, you kick out a destination node
as your key in the mapper and from nodes as
intermediate values. Concatenate from values in
the reducer assigning weights for each edge.

Assigned edge scores come from a computation
done in the reducer or number passed by key.

This gives a simple but weighted from/to
depiction and can be experimented with and
improved by subsequent passes or REST style
calls in the mapper for mysqldb weights.

Later,

Peter W.

Cam Bazz wrote:

> Hello,
>
> I have been long interested in storing graphs, in databases, object
> databases and lucene like indexes.
> ....
>
> Has anyone done any work on storing and processing graphs with map  
> reduce?
> If I were to start, where would I start from. I am interested in  
> finding
> shortest paths in a large graph.

Re: graph data representation for mapreduce

Posted by Ted Dunning <td...@veoh.com>.

I can't say anything which any specificity, but we store and manipulate
large graphs in hadoop.


On 2/1/08 1:14 PM, "Peter W." <pe...@marketingbrokers.com> wrote:

> Cam,
> 
> Making a directed graph in Hadoop is not
> very difficult but traversing live might be
> since the result is a separate file.
> 
> Basically, you kick out a destination node
> as your key in the mapper and from nodes as
> intermediate values. Concatenate from values in
> the reducer assigning weights for each edge.
> 
> Assigned edge scores come from a computation
> done in the reducer or number passed by key.
> 
> This gives a simple but weighted from/to
> depiction and can be experimented with and
> improved by subsequent passes or REST style
> calls in the mapper for mysqldb weights.
> 
> Later,
> 
> Peter W.
> 
> Cam Bazz wrote:
> 
>> Hello,
>> 
>> I have been long interested in storing graphs, in databases, object
>> databases and lucene like indexes.
>> ....
>> 
>> Has anyone done any work on storing and processing graphs with map
>> reduce?
>> If I were to start, where would I start from. I am interested in
>> finding
>> shortest paths in a large graph.
>