You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by udiw <ud...@gmail.com> on 2009/08/21 11:40:21 UTC

Job design question

Hi all,
I'm trying to design an MR job for processing "walks-on-graph" data from
database. The idea is that I have a list of random walks on a graph (which
is unknown).

I have two tables ("walk ids" and "hops"): 
- the first holds the list of random-walk ids, one row per walk, each is
unique id (increasing).
- the second holds, for each walk (identified by the uid) the list of hops
(vertices) traversed in the walk (one hop per row) 
-- these two tables are in a "one-to-many" structure, with the walk uid used
as a foreign key in the hops table.

Meaning, walks should be split between nodes but hops per walk must not.
How would you suggest handling this structure? is it even possible with
DBInputFormat?

Second, assuming it is possible to have this split in an MR job, I would
like to have different reducers that operate on the data during reading (I
want to avoid multiple reading since it can take a long time).
For example, one Reducer should create the actual graph: (Source Node,Dest
Node)-->(num_walks).
Another one should create a length analysis: (Origin Node, Final
Node)-->distance
etc.

Any comments and thoughts will help!
Thanks.
-- 
View this message in context: http://www.nabble.com/Job-design-question-tp25076132p25076132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.