You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2010/12/21 03:22:37 UTC
Re: breadth-first search
Check this slide out -
http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <We...@xerox.com> wrote:
>
> I implemented an algorithm to run hadoop on a 25GB graph data to
> calculate its average separation length.
> The input format is V1(tab)V2 (where V2 is the friend of V1).
> My purpose is to first randomly select some seed nodes, and then for
> each node, calculate the shortest paths from this node to all other
> nodes on the graph.
>
> To do this, I first run a simple python code in a single machine to get
> some random seed nodes.
> Then I run a hadoop job to generate adjacent list for each node as the
> input for the second job.
>
> The second job takes the adjacent list input and output the first level
> breadth-first search result. The nodes which are the friends of the seed
> node have distance 1. Then this output is the input for the next hadoop
> job so on so forth, until all the nodes are reached.
>
> I generated a simulated graph for testing. This data has only 100 nodes.
> Normal python code can find the separation length within 1 second (100
> seed nodes). However, the hadoop took almost 3 hours to do that
> (pseudo-distributed mode on one machine)!!
>
> I wonder if there is a more efficient way to do breadth-first search in
> hadoop? It is very inefficient to output so many intermediate results.
> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>
> Please help. Thanks!
>
> Wei
>
--
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org
Re: breadth-first search
Posted by Ted Dunning <td...@maprtech.com>.
Ahh... I see what you mean.
This algorithm can be implemented with all of the iterations for all points
proceeding in parallel. You should only need 4 map-reduce steps, not 400.
This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.
2010/12/21 Peng, Wei <We...@xerox.com>
> The graph that my BFS algorithm is running on only needs 4 levels to reach
> all nodes. The reason I say "many iterations" is that there are 100 sources
> nodes, so totally 400 iterations. The algorithm should be right, and I
> cannot think of anything to reduce the number of iterations.
>
> Ted, I will check out the links that you sent to me.
> I really appreciate your help.
>
> Wei
> -----Original Message-----
> From: Edward J. Yoon [mailto:edward@udanax.org]
> Sent: Tuesday, December 21, 2010 1:27 AM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
>
> There's no release yet.
>
> But, I had tested the BFS using hama and, hbase.
>
> Sent from my iPhone
>
> On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <We...@xerox.com> wrote:
>
> > Yoon,
> >
> > Can I use HAMA now, or it is still in development?
> >
> > Thanks
> >
> > Wei
> >
> > -----Original Message-----
> > From: Edward J. Yoon [mailto:edwardyoon@apache.org]
> > Sent: Monday, December 20, 2010 6:23 PM
> > To: common-user@hadoop.apache.org
> > Subject: Re: breadth-first search
> >
> > Check this slide out -
> > http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
> >
> > On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <We...@xerox.com> wrote:
> >>
> >> I implemented an algorithm to run hadoop on a 25GB graph data to
> >> calculate its average separation length.
> >> The input format is V1(tab)V2 (where V2 is the friend of V1).
> >> My purpose is to first randomly select some seed nodes, and then for
> >> each node, calculate the shortest paths from this node to all other
> >> nodes on the graph.
> >>
> >> To do this, I first run a simple python code in a single machine to get
> >> some random seed nodes.
> >> Then I run a hadoop job to generate adjacent list for each node as the
> >> input for the second job.
> >>
> >> The second job takes the adjacent list input and output the first level
> >> breadth-first search result. The nodes which are the friends of the seed
> >> node have distance 1. Then this output is the input for the next hadoop
> >> job so on so forth, until all the nodes are reached.
> >>
> >> I generated a simulated graph for testing. This data has only 100 nodes.
> >> Normal python code can find the separation length within 1 second (100
> >> seed nodes). However, the hadoop took almost 3 hours to do that
> >> (pseudo-distributed mode on one machine)!!
> >>
> >> I wonder if there is a more efficient way to do breadth-first search in
> >> hadoop? It is very inefficient to output so many intermediate results.
> >> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> >> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
> >>
> >> Please help. Thanks!
> >>
> >> Wei
> >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > edwardyoon@apache.org
> > http://blog.udanax.org
>
RE: breadth-first search
Posted by "Peng, Wei" <We...@xerox.com>.
The graph that my BFS algorithm is running on only needs 4 levels to reach all nodes. The reason I say "many iterations" is that there are 100 sources nodes, so totally 400 iterations. The algorithm should be right, and I cannot think of anything to reduce the number of iterations.
Ted, I will check out the links that you sent to me.
I really appreciate your help.
Wei
-----Original Message-----
From: Edward J. Yoon [mailto:edward@udanax.org]
Sent: Tuesday, December 21, 2010 1:27 AM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search
There's no release yet.
But, I had tested the BFS using hama and, hbase.
Sent from my iPhone
On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <We...@xerox.com> wrote:
> Yoon,
>
> Can I use HAMA now, or it is still in development?
>
> Thanks
>
> Wei
>
> -----Original Message-----
> From: Edward J. Yoon [mailto:edwardyoon@apache.org]
> Sent: Monday, December 20, 2010 6:23 PM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
>
> Check this slide out -
> http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
>
> On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <We...@xerox.com> wrote:
>>
>> I implemented an algorithm to run hadoop on a 25GB graph data to
>> calculate its average separation length.
>> The input format is V1(tab)V2 (where V2 is the friend of V1).
>> My purpose is to first randomly select some seed nodes, and then for
>> each node, calculate the shortest paths from this node to all other
>> nodes on the graph.
>>
>> To do this, I first run a simple python code in a single machine to get
>> some random seed nodes.
>> Then I run a hadoop job to generate adjacent list for each node as the
>> input for the second job.
>>
>> The second job takes the adjacent list input and output the first level
>> breadth-first search result. The nodes which are the friends of the seed
>> node have distance 1. Then this output is the input for the next hadoop
>> job so on so forth, until all the nodes are reached.
>>
>> I generated a simulated graph for testing. This data has only 100 nodes.
>> Normal python code can find the separation length within 1 second (100
>> seed nodes). However, the hadoop took almost 3 hours to do that
>> (pseudo-distributed mode on one machine)!!
>>
>> I wonder if there is a more efficient way to do breadth-first search in
>> hadoop? It is very inefficient to output so many intermediate results.
>> Totally there would be seedNodeNumber*levelNumber+1 jobs,
>> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>>
>> Please help. Thanks!
>>
>> Wei
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
Re: breadth-first search
Posted by "Edward J. Yoon" <ed...@udanax.org>.
There's no release yet.
But, I had tested the BFS using hama and, hbase.
Sent from my iPhone
On 2010. 12. 21., at 오전 11:30, "Peng, Wei" <We...@xerox.com> wrote:
> Yoon,
>
> Can I use HAMA now, or it is still in development?
>
> Thanks
>
> Wei
>
> -----Original Message-----
> From: Edward J. Yoon [mailto:edwardyoon@apache.org]
> Sent: Monday, December 20, 2010 6:23 PM
> To: common-user@hadoop.apache.org
> Subject: Re: breadth-first search
>
> Check this slide out -
> http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
>
> On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <We...@xerox.com> wrote:
>>
>> I implemented an algorithm to run hadoop on a 25GB graph data to
>> calculate its average separation length.
>> The input format is V1(tab)V2 (where V2 is the friend of V1).
>> My purpose is to first randomly select some seed nodes, and then for
>> each node, calculate the shortest paths from this node to all other
>> nodes on the graph.
>>
>> To do this, I first run a simple python code in a single machine to get
>> some random seed nodes.
>> Then I run a hadoop job to generate adjacent list for each node as the
>> input for the second job.
>>
>> The second job takes the adjacent list input and output the first level
>> breadth-first search result. The nodes which are the friends of the seed
>> node have distance 1. Then this output is the input for the next hadoop
>> job so on so forth, until all the nodes are reached.
>>
>> I generated a simulated graph for testing. This data has only 100 nodes.
>> Normal python code can find the separation length within 1 second (100
>> seed nodes). However, the hadoop took almost 3 hours to do that
>> (pseudo-distributed mode on one machine)!!
>>
>> I wonder if there is a more efficient way to do breadth-first search in
>> hadoop? It is very inefficient to output so many intermediate results.
>> Totally there would be seedNodeNumber*levelNumber+1 jobs,
>> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>>
>> Please help. Thanks!
>>
>> Wei
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
RE: breadth-first search
Posted by "Peng, Wei" <We...@xerox.com>.
Yoon,
Can I use HAMA now, or it is still in development?
Thanks
Wei
-----Original Message-----
From: Edward J. Yoon [mailto:edwardyoon@apache.org]
Sent: Monday, December 20, 2010 6:23 PM
To: common-user@hadoop.apache.org
Subject: Re: breadth-first search
Check this slide out -
http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei <We...@xerox.com> wrote:
>
> I implemented an algorithm to run hadoop on a 25GB graph data to
> calculate its average separation length.
> The input format is V1(tab)V2 (where V2 is the friend of V1).
> My purpose is to first randomly select some seed nodes, and then for
> each node, calculate the shortest paths from this node to all other
> nodes on the graph.
>
> To do this, I first run a simple python code in a single machine to get
> some random seed nodes.
> Then I run a hadoop job to generate adjacent list for each node as the
> input for the second job.
>
> The second job takes the adjacent list input and output the first level
> breadth-first search result. The nodes which are the friends of the seed
> node have distance 1. Then this output is the input for the next hadoop
> job so on so forth, until all the nodes are reached.
>
> I generated a simulated graph for testing. This data has only 100 nodes.
> Normal python code can find the separation length within 1 second (100
> seed nodes). However, the hadoop took almost 3 hours to do that
> (pseudo-distributed mode on one machine)!!
>
> I wonder if there is a more efficient way to do breadth-first search in
> hadoop? It is very inefficient to output so many intermediate results.
> Totally there would be seedNodeNumber*levelNumber+1 jobs,
> seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
>
> Please help. Thanks!
>
> Wei
>
--
Best Regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org