You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by Baldo Faieta <bf...@adobe.com> on 2012/10/03 03:01:12 UTC

Adsorption on giraph - memory problems

Hi Everyone.

I have implemented the Adsorption algorithm (
http://rio.ecs.umass.edu/~lga<http://rio.ecs.umass.edu/~lga=>o/ece697_10/Paper/random.pdf ,
http://talukdar.net/papers/adsorption_ecml0<http://talukdar.net/papers/adsorption_ecml0=>9.pdf )
as it seems well suited for running in giraph. I'm testing the data
with the movielens dataset  ( http://www.grouplens.org/node/73 ) and when
I run it with a small graph ( 6k nodes, 200k edges) it runs ok.

But as soon as I want to scale the graph I run into memory problems. I'm
running it with 3 processes and I have set the mapred.map.child.java.opts
variable pretty high ( 2G per process). Looking at the memory allocation
in each superstep, it seems that all the messages are allocated in memory
during a superstep before being processed and it runs out of memory
pretty quickly when I increase the size of the graph (e.g., 20k nodes,
1M edges).

The algorithm works by sending label distributions to outgoing vertices and
aggregating the distributions when it receives the messages. I have imple-
mented a combiner for the messages but it doesn't seem to help.

I think the problem is that the messages themselves, because they are dis-
tributions, they consume more memory than other examples (e.g., page rank)
and it seems that you need hefty memory allocation per process to keep all
the messages in memory before they can be processed or even combined. Is
this the case? Is there a way to be more aggressive with the combiner?
Ideally it would be great to store the messages offline until they can be
processed so as not to run into this problem. Does anyone have any
suggestions or I just have to get servers with much more memory?

BTW, if anyone is interested, I can try to post the implementation. I am
using it as a way to propagate resources to recommend to users based on
the relations of the users to the resources and the interrelations
between the resources with each other (e.g., user --viewed --> movie ,
director --directed --> movie , movie --is-genre-of --> genre, etc.)

Thanks,

Baldo


Re: Adsorption on giraph - memory problems

Posted by Avery Ching <ac...@apache.org>.
Hi Baldo,

We are using trove 
(http://mvnrepository.com/artifact/net.sf.trove4j/trove4j) to pack the 
vertices into a smaller size.  You are likely to see a nice benefit as 
well.  You could also try the out-of-core memory and vertex implementations.

Avery

On 10/2/12 6:01 PM, Baldo Faieta wrote:
> Hi Everyone.
>
> I have implemented the Adsorption algorithm (
> http://rio.ecs.umass.edu/~lga 
> <http://rio.ecs.umass.edu/%7Elga=>o/ece697_10/Paper/random.pdf ,
> http://talukdar.net/papers/adsorption_ecml0 
> <http://talukdar.net/papers/adsorption_ecml0=>9.pdf )
> as it seems well suited for running in giraph. I'm testing the data
> with the movielens dataset  ( http://www.grouplens.org/node/73 ) and when
> I run it with a small graph ( 6k nodes, 200k edges) it runs ok.
>
> But as soon as I want to scale the graph I run into memory problems. I'm
> running it with 3 processes and I have set the mapred.map.child.java.opts
> variable pretty high ( 2G per process). Looking at the memory allocation
> in each superstep, it seems that all the messages are allocated in memory
> during a superstep before being processed and it runs out of memory
> pretty quickly when I increase the size of the graph (e.g., 20k nodes,
> 1M edges).
>
> The algorithm works by sending label distributions to outgoing 
> vertices and
> aggregating the distributions when it receives the messages. I have imple-
> mented a combiner for the messages but it doesn't seem to help.
>
> I think the problem is that the messages themselves, because they are dis-
> tributions, they consume more memory than other examples (e.g., page rank)
> and it seems that you need hefty memory allocation per process to keep all
> the messages in memory before they can be processed or even combined. Is
> this the case? Is there a way to be more aggressive with the combiner?
> Ideally it would be great to store the messages offline until they can be
> processed so as not to run into this problem. Does anyone have any
> suggestions or I just have to get servers with much more memory?
>
> BTW, if anyone is interested, I can try to post the implementation. I am
> using it as a way to propagate resources to recommend to users based on
> the relations of the users to the resources and the interrelations
> between the resources with each other (e.g., user --viewed --> movie ,
> director --directed --> movie , movie --is-genre-of --> genre, etc.)
>
> Thanks,
>
> Baldo
>