You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@giraph.apache.org by André Kelpe <ef...@googlemail.com> on 2012/01/23 10:56:10 UTC

giraph stability problem

Hi list,

I have been investigating giraph for a week now and I have a huge stability
problem. I am running the trunk against a CDH3u2 hadoop cluster. The problem
I am trying to solve goes as follows:

In a graph there are 3 kinds of vertices:

1: end of the world vertices, which have only 1 connected edge
2: bivalent vertices, which have exactly 2 connected edges
3: multivalent vertices that have n-edges connected (n>2)

The goal is now to calculate for each vertex that is not in category 2 all
the paths to the other reachable non bivalent vertices. One could say all
pathes between all non-bivalent vertices.

To give an example:

                                        [9]
                                         |
                                         |
                                        <12>
                                         |
                                         |
      [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]

In this path I want to know that [5] forms a path to [7] via the edges <13>
and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
<11><13> with [5]. You get the idea... Directionality is not important.

The algorithm I have is pretty straight forward, in superstep 0 all vertices
that are non-bivalent send a message to their neighbours via which edge they
are reachable. In all following supersteps the bivalent vertices are simply
forwarding this information and the non-bivalent ones are terminating the
algorithm. The messages that they sent are made using Textwritable instances
encoding the path.

If I run this algorithm on input with 1 million edges it never finishes, the
master process and then the others always go out of memory, even with a 10GB
heap. I know that java programs can be memory hungry, but 10GB heap for
processing a 40MB input file is a bit to much in book. I have tried all sorts
of settings, like disabling checkpoints, but nothing makes it finish. I also
see a slowdown in processing, the first 20ish supersteps are done in no time,
but then the processing slows down until it crashes in superstep 47.

My questions are: What am I doing wrong? Do you guys have any pointers for
things to look after? How can I get this thing to finish? Why is it so memory
hungry?

Thanks a lot for your help!

-André

Re: giraph stability problem

Posted by Avery Ching <ac...@apache.org>.
Glad to hear you fixed your problem.  It would be great if you could 
describe any improvements that would help you have found the issues 
earlier.  Maybe we (or you) could add them =).

Avery

On 1/23/12 8:31 AM, André Kelpe wrote:
> Hi all,
>
> thanks for all the answers so far, it turns out that it actually isn't
> that much of a problem: I just had some inconsistencies in my input,
> which made giraph explode. I did a rerun with correct input data and
> now the whole thing finishes in a few seconds.
>
> It would of course be nice to have the described out-of-process
> messaging with spill over to disk for bigger problems, but that seems
> to be not necessary for the problem space I am in right now :-).
>
> --André


Re: giraph stability problem

Posted by André Kelpe <ef...@googlemail.com>.
Hi all,

thanks for all the answers so far, it turns out that it actually isn't
that much of a problem: I just had some inconsistencies in my input,
which made giraph explode. I did a rerun with correct input data and
now the whole thing finishes in a few seconds.

It would of course be nice to have the described out-of-process
messaging with spill over to disk for bigger problems, but that seems
to be not necessary for the problem space I am in right now :-).

--André

Re: giraph stability problem

Posted by Avery Ching <ac...@apache.org>.
There is certainly a class of graph problems that just "blow up", by 
increasing edges until the graph is fully connected or at least have 
information that requires n^2 memory.  Unfortunately, this does require 
a lot of space (memory and disk).  Not much you can do about it except 
change your algorithm, or work on a smaller problem size.  For instance, 
the shortest paths algorithm (SimpleShortestPathsVertex) could be 
changed to find the actual shortest path to every node (but this would 
blow up memory/space as well).  Disk can help though, for some problems, 
since we typically have 1-2 more orders of magnitude of disk than memory.

Avery

On 1/23/12 7:16 AM, Claudio Martella wrote:
> There's not much you can do If you're trying an algorithm with
> exponential space complexity and you don't have enough disk space.
> What do you suggest?
>
> On Mon, Jan 23, 2012 at 4:13 PM, Deepak Nettem<de...@gmail.com>  wrote:
>> I have seen people run into this problem when doing Graph Processing
>> directly on top of Hadoop too. This kind of an approach would take
>> exponential space.
>>
>> While the proposed solution would prevent the OOM problem, eventually one
>> would run out of disk space.
>>
>>
>> On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella
>> <cl...@gmail.com>  wrote:
>>> Hi,
>>>
>>> I've been struggling with a similar problem and that's why i've
>>> started working on the out-of-core message management, when the memory
>>> shrinks. Your particular problem can get to an upper bound of
>>> exponential space complexity, which you experience with the OOM. the
>>> possible paths you can extract for each source vertex are about
>>> O(d^l), where d is the average degree of the graph and l is the max
>>> length of the extracted path (if you do shortest paths, then it's the
>>> diameter).
>>>
>>> Giraph is all good and fast but it's all in-memory and for this reason
>>> it currently lacks a solution to your problem. I suggest you wait
>>> until GIRAPH-45 is ready (I should write an email tonight about that
>>> patch).
>>>
>>> Hope it makes sense to you,
>>> Claudio
>>>
>>> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
>>> <ef...@googlemail.com>  wrote:
>>>> Hi list,
>>>>
>>>> I have been investigating giraph for a week now and I have a huge
>>>> stability
>>>> problem. I am running the trunk against a CDH3u2 hadoop cluster. The
>>>> problem
>>>> I am trying to solve goes as follows:
>>>>
>>>> In a graph there are 3 kinds of vertices:
>>>>
>>>> 1: end of the world vertices, which have only 1 connected edge
>>>> 2: bivalent vertices, which have exactly 2 connected edges
>>>> 3: multivalent vertices that have n-edges connected (n>2)
>>>>
>>>> The goal is now to calculate for each vertex that is not in category 2
>>>> all
>>>> the paths to the other reachable non bivalent vertices. One could say
>>>> all
>>>> pathes between all non-bivalent vertices.
>>>>
>>>> To give an example:
>>>>
>>>>                                         [9]
>>>>                                          |
>>>>                                          |
>>>>                                         <12>
>>>>                                          |
>>>>                                          |
>>>>       [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
>>>>
>>>> In this path I want to know that [5] forms a path to [7] via the edges
>>>> <13>
>>>> and<11>. [7] forms a path via<12>  with [9], via<10>  with [8] and via
>>>> <11><13>  with [5]. You get the idea... Directionality is not important.
>>>>
>>>> The algorithm I have is pretty straight forward, in superstep 0 all
>>>> vertices
>>>> that are non-bivalent send a message to their neighbours via which edge
>>>> they
>>>> are reachable. In all following supersteps the bivalent vertices are
>>>> simply
>>>> forwarding this information and the non-bivalent ones are terminating
>>>> the
>>>> algorithm. The messages that they sent are made using Textwritable
>>>> instances
>>>> encoding the path.
>>>>
>>>> If I run this algorithm on input with 1 million edges it never finishes,
>>>> the
>>>> master process and then the others always go out of memory, even with a
>>>> 10GB
>>>> heap. I know that java programs can be memory hungry, but 10GB heap for
>>>> processing a 40MB input file is a bit to much in book. I have tried all
>>>> sorts
>>>> of settings, like disabling checkpoints, but nothing makes it finish. I
>>>> also
>>>> see a slowdown in processing, the first 20ish supersteps are done in no
>>>> time,
>>>> but then the processing slows down until it crashes in superstep 47.
>>>>
>>>> My questions are: What am I doing wrong? Do you guys have any pointers
>>>> for
>>>> things to look after? How can I get this thing to finish? Why is it so
>>>> memory
>>>> hungry?
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>> -André
>>>
>>>
>>> --
>>>     Claudio Martella
>>>     claudio.martella@gmail.com
>
>


Re: giraph stability problem

Posted by Claudio Martella <cl...@gmail.com>.
There's not much you can do If you're trying an algorithm with
exponential space complexity and you don't have enough disk space.
What do you suggest?

On Mon, Jan 23, 2012 at 4:13 PM, Deepak Nettem <de...@gmail.com> wrote:
> I have seen people run into this problem when doing Graph Processing
> directly on top of Hadoop too. This kind of an approach would take
> exponential space.
>
> While the proposed solution would prevent the OOM problem, eventually one
> would run out of disk space.
>
>
> On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella
> <cl...@gmail.com> wrote:
>>
>> Hi,
>>
>> I've been struggling with a similar problem and that's why i've
>> started working on the out-of-core message management, when the memory
>> shrinks. Your particular problem can get to an upper bound of
>> exponential space complexity, which you experience with the OOM. the
>> possible paths you can extract for each source vertex are about
>> O(d^l), where d is the average degree of the graph and l is the max
>> length of the extracted path (if you do shortest paths, then it's the
>> diameter).
>>
>> Giraph is all good and fast but it's all in-memory and for this reason
>> it currently lacks a solution to your problem. I suggest you wait
>> until GIRAPH-45 is ready (I should write an email tonight about that
>> patch).
>>
>> Hope it makes sense to you,
>> Claudio
>>
>> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
>> <ef...@googlemail.com> wrote:
>> > Hi list,
>> >
>> > I have been investigating giraph for a week now and I have a huge
>> > stability
>> > problem. I am running the trunk against a CDH3u2 hadoop cluster. The
>> > problem
>> > I am trying to solve goes as follows:
>> >
>> > In a graph there are 3 kinds of vertices:
>> >
>> > 1: end of the world vertices, which have only 1 connected edge
>> > 2: bivalent vertices, which have exactly 2 connected edges
>> > 3: multivalent vertices that have n-edges connected (n>2)
>> >
>> > The goal is now to calculate for each vertex that is not in category 2
>> > all
>> > the paths to the other reachable non bivalent vertices. One could say
>> > all
>> > pathes between all non-bivalent vertices.
>> >
>> > To give an example:
>> >
>> >                                        [9]
>> >                                         |
>> >                                         |
>> >                                        <12>
>> >                                         |
>> >                                         |
>> >      [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
>> >
>> > In this path I want to know that [5] forms a path to [7] via the edges
>> > <13>
>> > and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
>> > <11><13> with [5]. You get the idea... Directionality is not important.
>> >
>> > The algorithm I have is pretty straight forward, in superstep 0 all
>> > vertices
>> > that are non-bivalent send a message to their neighbours via which edge
>> > they
>> > are reachable. In all following supersteps the bivalent vertices are
>> > simply
>> > forwarding this information and the non-bivalent ones are terminating
>> > the
>> > algorithm. The messages that they sent are made using Textwritable
>> > instances
>> > encoding the path.
>> >
>> > If I run this algorithm on input with 1 million edges it never finishes,
>> > the
>> > master process and then the others always go out of memory, even with a
>> > 10GB
>> > heap. I know that java programs can be memory hungry, but 10GB heap for
>> > processing a 40MB input file is a bit to much in book. I have tried all
>> > sorts
>> > of settings, like disabling checkpoints, but nothing makes it finish. I
>> > also
>> > see a slowdown in processing, the first 20ish supersteps are done in no
>> > time,
>> > but then the processing slows down until it crashes in superstep 47.
>> >
>> > My questions are: What am I doing wrong? Do you guys have any pointers
>> > for
>> > things to look after? How can I get this thing to finish? Why is it so
>> > memory
>> > hungry?
>> >
>> > Thanks a lot for your help!
>> >
>> > -André
>>
>>
>>
>> --
>>    Claudio Martella
>>    claudio.martella@gmail.com



-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: giraph stability problem

Posted by Deepak Nettem <de...@gmail.com>.
I have seen people run into this problem when doing Graph Processing
directly on top of Hadoop too. This kind of an approach would take
exponential space.

While the proposed solution would prevent the OOM problem, eventually one
would run out of disk space.

On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella <
claudio.martella@gmail.com> wrote:

> Hi,
>
> I've been struggling with a similar problem and that's why i've
> started working on the out-of-core message management, when the memory
> shrinks. Your particular problem can get to an upper bound of
> exponential space complexity, which you experience with the OOM. the
> possible paths you can extract for each source vertex are about
> O(d^l), where d is the average degree of the graph and l is the max
> length of the extracted path (if you do shortest paths, then it's the
> diameter).
>
> Giraph is all good and fast but it's all in-memory and for this reason
> it currently lacks a solution to your problem. I suggest you wait
> until GIRAPH-45 is ready (I should write an email tonight about that
> patch).
>
> Hope it makes sense to you,
> Claudio
>
> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
> <ef...@googlemail.com> wrote:
> > Hi list,
> >
> > I have been investigating giraph for a week now and I have a huge
> stability
> > problem. I am running the trunk against a CDH3u2 hadoop cluster. The
> problem
> > I am trying to solve goes as follows:
> >
> > In a graph there are 3 kinds of vertices:
> >
> > 1: end of the world vertices, which have only 1 connected edge
> > 2: bivalent vertices, which have exactly 2 connected edges
> > 3: multivalent vertices that have n-edges connected (n>2)
> >
> > The goal is now to calculate for each vertex that is not in category 2
> all
> > the paths to the other reachable non bivalent vertices. One could say all
> > pathes between all non-bivalent vertices.
> >
> > To give an example:
> >
> >                                        [9]
> >                                         |
> >                                         |
> >                                        <12>
> >                                         |
> >                                         |
> >      [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
> >
> > In this path I want to know that [5] forms a path to [7] via the edges
> <13>
> > and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
> > <11><13> with [5]. You get the idea... Directionality is not important.
> >
> > The algorithm I have is pretty straight forward, in superstep 0 all
> vertices
> > that are non-bivalent send a message to their neighbours via which edge
> they
> > are reachable. In all following supersteps the bivalent vertices are
> simply
> > forwarding this information and the non-bivalent ones are terminating the
> > algorithm. The messages that they sent are made using Textwritable
> instances
> > encoding the path.
> >
> > If I run this algorithm on input with 1 million edges it never finishes,
> the
> > master process and then the others always go out of memory, even with a
> 10GB
> > heap. I know that java programs can be memory hungry, but 10GB heap for
> > processing a 40MB input file is a bit to much in book. I have tried all
> sorts
> > of settings, like disabling checkpoints, but nothing makes it finish. I
> also
> > see a slowdown in processing, the first 20ish supersteps are done in no
> time,
> > but then the processing slows down until it crashes in superstep 47.
> >
> > My questions are: What am I doing wrong? Do you guys have any pointers
> for
> > things to look after? How can I get this thing to finish? Why is it so
> memory
> > hungry?
> >
> > Thanks a lot for your help!
> >
> > -André
>
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>

Re: giraph stability problem

Posted by Claudio Martella <cl...@gmail.com>.
Hi,

I've been struggling with a similar problem and that's why i've
started working on the out-of-core message management, when the memory
shrinks. Your particular problem can get to an upper bound of
exponential space complexity, which you experience with the OOM. the
possible paths you can extract for each source vertex are about
O(d^l), where d is the average degree of the graph and l is the max
length of the extracted path (if you do shortest paths, then it's the
diameter).

Giraph is all good and fast but it's all in-memory and for this reason
it currently lacks a solution to your problem. I suggest you wait
until GIRAPH-45 is ready (I should write an email tonight about that
patch).

Hope it makes sense to you,
Claudio

On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
<ef...@googlemail.com> wrote:
> Hi list,
>
> I have been investigating giraph for a week now and I have a huge stability
> problem. I am running the trunk against a CDH3u2 hadoop cluster. The problem
> I am trying to solve goes as follows:
>
> In a graph there are 3 kinds of vertices:
>
> 1: end of the world vertices, which have only 1 connected edge
> 2: bivalent vertices, which have exactly 2 connected edges
> 3: multivalent vertices that have n-edges connected (n>2)
>
> The goal is now to calculate for each vertex that is not in category 2 all
> the paths to the other reachable non bivalent vertices. One could say all
> pathes between all non-bivalent vertices.
>
> To give an example:
>
>                                        [9]
>                                         |
>                                         |
>                                        <12>
>                                         |
>                                         |
>      [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
>
> In this path I want to know that [5] forms a path to [7] via the edges <13>
> and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
> <11><13> with [5]. You get the idea... Directionality is not important.
>
> The algorithm I have is pretty straight forward, in superstep 0 all vertices
> that are non-bivalent send a message to their neighbours via which edge they
> are reachable. In all following supersteps the bivalent vertices are simply
> forwarding this information and the non-bivalent ones are terminating the
> algorithm. The messages that they sent are made using Textwritable instances
> encoding the path.
>
> If I run this algorithm on input with 1 million edges it never finishes, the
> master process and then the others always go out of memory, even with a 10GB
> heap. I know that java programs can be memory hungry, but 10GB heap for
> processing a 40MB input file is a bit to much in book. I have tried all sorts
> of settings, like disabling checkpoints, but nothing makes it finish. I also
> see a slowdown in processing, the first 20ish supersteps are done in no time,
> but then the processing slows down until it crashes in superstep 47.
>
> My questions are: What am I doing wrong? Do you guys have any pointers for
> things to look after? How can I get this thing to finish? Why is it so memory
> hungry?
>
> Thanks a lot for your help!
>
> -André



-- 
   Claudio Martella
   claudio.martella@gmail.com