You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Benjamin Heitmann <be...@deri.org> on 2012/07/09 14:44:58 UTC

Re: Suggestions on problem sizes for giraph performance benchmarking

Hello Stephen, 

sorry for the very late reply. 

On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX) wrote:

> Hello Avery and all:
> I have a cluster of 10  two-processor/48 GB RAM servers, upon which we are conducting Hadoop performance characterization tests.  I plan to use the Giraph pagerank and simple shortest path example tests as part of this exercise and would appreciate guidance on problem sizes for both tests.  I’m looking at paring down an obfuscated Twitter dataset and it would save a lot of time if someone has some knowledge on roughly how the time and memory scales with number of nodes in a graph.
> 


I can provide some suggestions for the kind of algorithm and data which does currently surpass the scalability of giraph. 

While the limits to my knowledge of Giraph and Hadoop are probably also to blame for this, please see the recent discussions on this list, 
and on JIRA for other indications that the scalability of Giraph needs improvement: 
* post  by Yuanyuan Tian in the thread "wierd communication errors" on user@giraph.apache.org
* GIRAPH-234 about GC overhead https://issues.apache.org/jira/browse/GIRAPH-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

If you want to stretch the limits of Giraph, then you need to try an algorithm which is conceptually different from PageRank, and you need a big data set. 
If you use an algorithm which has complex application logic (maybe even domain specific logic), which needs to be embedded in the algorithm, 
then the nodes need to have a lot of state. In addition, such algorithms probably send around a lot of messages, and each of the messages might have a payload
which is more complex then one floating point number. In addition, it helps to have a graph format, which requires strings on the edges and vertices. 
The strings are required for the domain specific business logic which the graph algorithm needs to follow. 

Finally, imagine a data set which has a big loading time, and where one run of the algorithm only provides results for one user.
The standard Hadoop paradigm is to throw away the graph after loading it. 
So if you have 100s or 1000s of users, then you need a way to execute the algorithm multiple times in parallel. 
Again this will add a lot of state, as each of the vertices will need to hold one state object for each user who has visited the vertex.

In my specific case, I had the following data and algorithm:
Data:  
* an RDF graph with 10 million vertices and 40 million edges 
I used my own import code to map the RDF graph to a undirected graph with a limit of one edge between any two nodes (so it was not a multi-graph)
* each vertex and each edge uses a string as an identity to represent a URI in the RDF graph (required for the business logic in the algorithm)

Algorithm: 
* spreading activation. 
You can think of it as depth first search guided by domain specific logic. 
A short introduction here: https://en.wikipedia.org/wiki/Spreading_activation
The wikipedia article only mentions using spreading activation on weighted graphs, however I used it on graphs which have additional types on the edges. 
The whole area of using the semantics of the edges to guide the algorithm is an active research topic, so thats why I can't point you to a good article on that. 
* parallel execution: 
I need to run the algorithm once for every user in the system, however loading the data set takes around 15 minutes alone. 
So each node has an array of states, one for each user for which the algorithm has visited a node. 
I experimented with user numbers between 30 and 1000, anything more did not work for concurrent execution of the algorithm. 

Infrastructure: 
* a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
* Hadoop 1.0, pseudo-distributed setup
* between 10 and 20 Giraph workers


A few weeks ago I stopped work on my Giraph based implementation, as Giraph ran out of memory almost immediately after loading and initialising the data. 
I made sure that the Giraph workers do not run out of memory, so it was probably due to IPC and messaging. 
The general discussion on the Giraph mailing list strongly indicates that I did hit the current IPC scalability limits.

Currently I am working on a non-Hadoop version of the algorithm which is not as scalable but which is fast for *one* user. ( less then a second per user, but single threaded). 
In addition, this new version allows me to better integrate with an existing ecosystem of technologies (Semantic Web technologies) to which Hadoop and Giraph is currently completely disconnected. 
However, I will probably revisit Giraph at some time on the future. 


If you want to look at the code or the data or any other asset which I have, then I will gladly share that with you. 
I would really like Giraph to reach the maturity required for this kind of algorithm. 
However, I have the feeling that the current development focus is on clear-cut numerical algorithms such as pagerank. 


Re: Suggestions on problem sizes for giraph performance benchmarking

Posted by Amani Alonazi <am...@kaust.edu.sa>.
Hi Benjamin,

I'm really interesting in this kind of algorithm, if you don't mind to
share it with me. I'm working in Giraph and other pregel - clone system to
demonstrate some graph algorithms such as connected components (strong and
week), max clique, Eulerian Path and others. It would be helpful if you
share with me the code.

Thank you,

On Mon, Jul 9, 2012 at 3:44 PM, Benjamin Heitmann <
benjamin.heitmann@deri.org> wrote:

>
> Hello Stephen,
>
> sorry for the very late reply.
>
> On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX) wrote:
>
>  Hello Avery and all:****
>
> I have a cluster of 10  two-processor/48 GB RAM servers, upon which we are
> conducting Hadoop performance characterization tests.  I plan to use the
> Giraph pagerank and simple shortest path example tests as part of this
> exercise and would appreciate guidance on problem sizes for both tests.
> I’m looking at paring down an obfuscated Twitter dataset and it would save
> a lot of time if someone has some knowledge on roughly how the time and
> memory scales with number of nodes in a graph.
>
> ****
>
>
>
> I can provide some suggestions for the kind of algorithm and data which
> does currently surpass the scalability of giraph.
>
> While the limits to my knowledge of Giraph and Hadoop are probably also to
> blame for this, please see the recent discussions on this list,
> and on JIRA for other indications that the scalability of Giraph needs
> improvement:
> * post  by Yuanyuan Tian in the thread "wierd communication errors" on
> user@giraph.apache.org
> * GIRAPH-234 about GC overhead
> https://issues.apache.org/jira/browse/GIRAPH-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>
> If you want to stretch the limits of Giraph, then you need to try an
> algorithm which is conceptually different from PageRank, and you need a big
> data set.
> If you use an algorithm which has complex application logic (maybe even
> domain specific logic), which needs to be embedded in the algorithm,
> then the nodes need to have a lot of state. In addition, such algorithms
> probably send around a lot of messages, and each of the messages might have
> a payload
> which is more complex then one floating point number. In addition, it
> helps to have a graph format, which requires strings on the edges and
> vertices.
> The strings are required for the domain specific business logic which the
> graph algorithm needs to follow.
>
> Finally, imagine a data set which has a big loading time, and where one
> run of the algorithm only provides results for one user.
> The standard Hadoop paradigm is to throw away the graph after loading it.
> So if you have 100s or 1000s of users, then you need a way to execute the
> algorithm multiple times in parallel.
> Again this will add a lot of state, as each of the vertices will need to
> hold one state object for each user who has visited the vertex.
>
> In my specific case, I had the following data and algorithm:
> Data:
> * an RDF graph with 10 million vertices and 40 million edges
> I used my own import code to map the RDF graph to a undirected graph with
> a limit of one edge between any two nodes (so it was not a multi-graph)
> * each vertex and each edge uses a string as an identity to represent a
> URI in the RDF graph (required for the business logic in the algorithm)
>
> Algorithm:
> * spreading activation.
> You can think of it as depth first search guided by domain specific logic.
> A short introduction here:
> https://en.wikipedia.org/wiki/Spreading_activation
> The wikipedia article only mentions using spreading activation on weighted
> graphs, however I used it on graphs which have additional types on the
> edges.
> The whole area of using the semantics of the edges to guide the algorithm
> is an active research topic, so thats why I can't point you to a good
> article on that.
> * parallel execution:
> I need to run the algorithm once for every user in the system, however
> loading the data set takes around 15 minutes alone.
> So each node has an array of states, one for each user for which the
> algorithm has visited a node.
> I experimented with user numbers between 30 and 1000, anything more did
> not work for concurrent execution of the algorithm.
>
> Infrastructure:
> * a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
> * Hadoop 1.0, pseudo-distributed setup
> * between 10 and 20 Giraph workers
>
>
> A few weeks ago I stopped work on my Giraph based implementation, as
> Giraph ran out of memory almost immediately after loading and initialising
> the data.
> I made sure that the Giraph workers do not run out of memory, so it was
> probably due to IPC and messaging.
> The general discussion on the Giraph mailing list strongly indicates that
> I did hit the current IPC scalability limits.
>
> Currently I am working on a non-Hadoop version of the algorithm which is
> not as scalable but which is fast for *one* user. ( less then a second per
> user, but single threaded).
> In addition, this new version allows me to better integrate with an
> existing ecosystem of technologies (Semantic Web technologies) to which
> Hadoop and Giraph is currently completely disconnected.
> However, I will probably revisit Giraph at some time on the future.
>
>
> If you want to look at the code or the data or any other asset which I
> have, then I will gladly share that with you.
> I would really like Giraph to reach the maturity required for this kind of
> algorithm.
> However, I have the feeling that the current development focus is on
> clear-cut numerical algorithms such as pagerank.
>
>


-- 
Amani AlOnazi
MSc Computer Science
King Abdullah University of Science and Technology
Kingdom of Saudi Arabia
amani.alonazi@kaust.edu.sa | +966 (0) 555 191 795

-- 

------------------------------
This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.

Re: Suggestions on problem sizes for giraph performance benchmarking

Posted by Avery Ching <ac...@apache.org>.
You should try using the appropriate memory settings (i.e.  
-Dmapred.child.java.opts="-Xms30g -Xmx30g -Xss128k") for a 30 GB heap.  
This depends on how much memory you can get.

Avery

On 7/9/12 5:57 AM, Amani Alonazi wrote:
> Actually, I had the same problem of running out of memory with Giraph 
> when trying to implement strongly connected components algorithm on 
> Giraph. My input graph is 1 million nodes and 7 million edges.
>
> I'm using cluster of 21 computers.
>
>
> On Mon, Jul 9, 2012 at 3:44 PM, Benjamin Heitmann 
> <benjamin.heitmann@deri.org <ma...@deri.org>> wrote:
>
>
>     Hello Stephen,
>
>     sorry for the very late reply.
>
>     On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX)
>     wrote:
>
>>     Hello Avery and all:
>>
>>     I have a cluster of 10  two-processor/48 GB RAM servers, upon
>>     which we are conducting Hadoop performance characterization
>>     tests.  I plan to use the Giraph pagerank and simple shortest
>>     path example tests as part of this exercise and would appreciate
>>     guidance on problem sizes for both tests.  I’m looking at paring
>>     down an obfuscated Twitter dataset and it would save a lot of
>>     time if someone has some knowledge on roughly how the time and
>>     memory scales with number of nodes in a graph.
>>
>
>
>     I can provide some suggestions for the kind of algorithm and data
>     which does currently surpass the scalability of giraph.
>
>     While the limits to my knowledge of Giraph and Hadoop are probably
>     also to blame for this, please see the recent discussions on this
>     list,
>     and on JIRA for other indications that the scalability of Giraph
>     needs improvement:
>     * post  by Yuanyuan Tian in the thread "wierd communication
>     errors" on user@giraph.apache.org <ma...@giraph.apache.org>
>     * GIRAPH-234 about GC overhead
>     https://issues.apache.org/jira/browse/GIRAPH-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>
>     If you want to stretch the limits of Giraph, then you need to try
>     an algorithm which is conceptually different from PageRank, and
>     you need a big data set.
>     If you use an algorithm which has complex application logic (maybe
>     even domain specific logic), which needs to be embedded in the
>     algorithm,
>     then the nodes need to have a lot of state. In addition, such
>     algorithms probably send around a lot of messages, and each of the
>     messages might have a payload
>     which is more complex then one floating point number. In addition,
>     it helps to have a graph format, which requires strings on the
>     edges and vertices.
>     The strings are required for the domain specific business logic
>     which the graph algorithm needs to follow.
>
>     Finally, imagine a data set which has a big loading time, and
>     where one run of the algorithm only provides results for one user.
>     The standard Hadoop paradigm is to throw away the graph after
>     loading it.
>     So if you have 100s or 1000s of users, then you need a way to
>     execute the algorithm multiple times in parallel.
>     Again this will add a lot of state, as each of the vertices will
>     need to hold one state object for each user who has visited the
>     vertex.
>
>     In my specific case, I had the following data and algorithm:
>     Data:
>     * an RDF graph with 10 million vertices and 40 million edges
>     I used my own import code to map the RDF graph to a undirected
>     graph with a limit of one edge between any two nodes (so it was
>     not a multi-graph)
>     * each vertex and each edge uses a string as an identity to
>     represent a URI in the RDF graph (required for the business logic
>     in the algorithm)
>
>     Algorithm:
>     * spreading activation.
>     You can think of it as depth first search guided by domain
>     specific logic.
>     A short introduction here:
>     https://en.wikipedia.org/wiki/Spreading_activation
>     The wikipedia article only mentions using spreading activation on
>     weighted graphs, however I used it on graphs which have additional
>     types on the edges.
>     The whole area of using the semantics of the edges to guide the
>     algorithm is an active research topic, so thats why I can't point
>     you to a good article on that.
>     * parallel execution:
>     I need to run the algorithm once for every user in the system,
>     however loading the data set takes around 15 minutes alone.
>     So each node has an array of states, one for each user for which
>     the algorithm has visited a node.
>     I experimented with user numbers between 30 and 1000, anything
>     more did not work for concurrent execution of the algorithm.
>
>     Infrastructure:
>     * a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
>     * Hadoop 1.0, pseudo-distributed setup
>     * between 10 and 20 Giraph workers
>
>
>     A few weeks ago I stopped work on my Giraph based implementation,
>     as Giraph ran out of memory almost immediately after loading and
>     initialising the data.
>     I made sure that the Giraph workers do not run out of memory, so
>     it was probably due to IPC and messaging.
>     The general discussion on the Giraph mailing list strongly
>     indicates that I did hit the current IPC scalability limits.
>
>     Currently I am working on a non-Hadoop version of the algorithm
>     which is not as scalable but which is fast for *one* user. ( less
>     then a second per user, but single threaded).
>     In addition, this new version allows me to better integrate with
>     an existing ecosystem of technologies (Semantic Web technologies)
>     to which Hadoop and Giraph is currently completely disconnected.
>     However, I will probably revisit Giraph at some time on the future.
>
>
>     If you want to look at the code or the data or any other asset
>     which I have, then I will gladly share that with you.
>     I would really like Giraph to reach the maturity required for this
>     kind of algorithm.
>     However, I have the feeling that the current development focus is
>     on clear-cut numerical algorithms such as pagerank.
>
>
>
>
> -- 
> Amani AlOnazi
> MSc Computer Science
> King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
> amani.alonazi@kaust.edu.sa <ma...@kaust.edu.sa> | +966 
> (0) 555 191 795
>
>
> ------------------------------------------------------------------------
> This message and its contents, including attachments are intended 
> solely for the original recipient. If you are not the intended 
> recipient or have received this message in error, please notify me 
> immediately and delete this message from your computer system. Any 
> unauthorized use or distribution is prohibited. Please consider the 
> environment before printing this email. 



Re: Suggestions on problem sizes for giraph performance benchmarking

Posted by Amani Alonazi <am...@kaust.edu.sa>.
Actually, I had the same problem of running out of memory with Giraph when
trying to implement strongly connected components algorithm on Giraph. My
input graph is 1 million nodes and 7 million edges.

I'm using cluster of 21 computers.


On Mon, Jul 9, 2012 at 3:44 PM, Benjamin Heitmann <
benjamin.heitmann@deri.org> wrote:

>
> Hello Stephen,
>
> sorry for the very late reply.
>
> On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX) wrote:
>
>  Hello Avery and all:****
>
> I have a cluster of 10  two-processor/48 GB RAM servers, upon which we are
> conducting Hadoop performance characterization tests.  I plan to use the
> Giraph pagerank and simple shortest path example tests as part of this
> exercise and would appreciate guidance on problem sizes for both tests.
> I’m looking at paring down an obfuscated Twitter dataset and it would save
> a lot of time if someone has some knowledge on roughly how the time and
> memory scales with number of nodes in a graph.
>
> ****
>
>
>
> I can provide some suggestions for the kind of algorithm and data which
> does currently surpass the scalability of giraph.
>
> While the limits to my knowledge of Giraph and Hadoop are probably also to
> blame for this, please see the recent discussions on this list,
> and on JIRA for other indications that the scalability of Giraph needs
> improvement:
> * post  by Yuanyuan Tian in the thread "wierd communication errors" on
> user@giraph.apache.org
> * GIRAPH-234 about GC overhead
> https://issues.apache.org/jira/browse/GIRAPH-234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>
> If you want to stretch the limits of Giraph, then you need to try an
> algorithm which is conceptually different from PageRank, and you need a big
> data set.
> If you use an algorithm which has complex application logic (maybe even
> domain specific logic), which needs to be embedded in the algorithm,
> then the nodes need to have a lot of state. In addition, such algorithms
> probably send around a lot of messages, and each of the messages might have
> a payload
> which is more complex then one floating point number. In addition, it
> helps to have a graph format, which requires strings on the edges and
> vertices.
> The strings are required for the domain specific business logic which the
> graph algorithm needs to follow.
>
> Finally, imagine a data set which has a big loading time, and where one
> run of the algorithm only provides results for one user.
> The standard Hadoop paradigm is to throw away the graph after loading it.
> So if you have 100s or 1000s of users, then you need a way to execute the
> algorithm multiple times in parallel.
> Again this will add a lot of state, as each of the vertices will need to
> hold one state object for each user who has visited the vertex.
>
> In my specific case, I had the following data and algorithm:
> Data:
> * an RDF graph with 10 million vertices and 40 million edges
> I used my own import code to map the RDF graph to a undirected graph with
> a limit of one edge between any two nodes (so it was not a multi-graph)
> * each vertex and each edge uses a string as an identity to represent a
> URI in the RDF graph (required for the business logic in the algorithm)
>
> Algorithm:
> * spreading activation.
> You can think of it as depth first search guided by domain specific logic.
> A short introduction here:
> https://en.wikipedia.org/wiki/Spreading_activation
> The wikipedia article only mentions using spreading activation on weighted
> graphs, however I used it on graphs which have additional types on the
> edges.
> The whole area of using the semantics of the edges to guide the algorithm
> is an active research topic, so thats why I can't point you to a good
> article on that.
> * parallel execution:
> I need to run the algorithm once for every user in the system, however
> loading the data set takes around 15 minutes alone.
> So each node has an array of states, one for each user for which the
> algorithm has visited a node.
> I experimented with user numbers between 30 and 1000, anything more did
> not work for concurrent execution of the algorithm.
>
> Infrastructure:
> * a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
> * Hadoop 1.0, pseudo-distributed setup
> * between 10 and 20 Giraph workers
>
>
> A few weeks ago I stopped work on my Giraph based implementation, as
> Giraph ran out of memory almost immediately after loading and initialising
> the data.
> I made sure that the Giraph workers do not run out of memory, so it was
> probably due to IPC and messaging.
> The general discussion on the Giraph mailing list strongly indicates that
> I did hit the current IPC scalability limits.
>
> Currently I am working on a non-Hadoop version of the algorithm which is
> not as scalable but which is fast for *one* user. ( less then a second per
> user, but single threaded).
> In addition, this new version allows me to better integrate with an
> existing ecosystem of technologies (Semantic Web technologies) to which
> Hadoop and Giraph is currently completely disconnected.
> However, I will probably revisit Giraph at some time on the future.
>
>
> If you want to look at the code or the data or any other asset which I
> have, then I will gladly share that with you.
> I would really like Giraph to reach the maturity required for this kind of
> algorithm.
> However, I have the feeling that the current development focus is on
> clear-cut numerical algorithms such as pagerank.
>
>


-- 
Amani AlOnazi
MSc Computer Science
King Abdullah University of Science and Technology
Kingdom of Saudi Arabia
amani.alonazi@kaust.edu.sa | +966 (0) 555 191 795

-- 

------------------------------
This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.