You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@giraph.apache.org by Claudio Martella <cl...@gmail.com> on 2012/08/17 17:17:46 UTC

Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low

Yes, that is definitely the direction you may want to take at a certain
moment. That is basically what Stanford gps does as well, and stratosphere
too.

On Friday, August 17, 2012, Alessandro Presta wrote:

> I think at that point it would be worth having a new logical place for
> vertex/edge representation at worker- or partition-level.
> Avery had some ideas about this.
>
> Basically right now we're giving the user the freedom (and responsibility)
> to choose a representation (both in-memory and for serialization), but
> another way to go would be to take care of all that at infrastructure
> level and expose only one Vertex class (where the user only defines the
> computation details and everything else is abstracted away). Then we could
> play around with compact representations and even more disruptive
> strategies (like streaming the graph/messages and re-using objects).
>
> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org<javascript:;>
> >
> wrote:
>
> >I was under the impression that 100k was the upper limit to make things
> >work without crashing.
> >
> >In any case, if one wanted to use a compressed memory representation by
> >aggregating different edge lists together, could one use the worker
> >context
> >as a central point of access to the compressed graphs?
> >I can imagine a vertex class that has only the ID and uses the worker
> >context to access its edge list (i.e. it is only a client to a central
> >per-machine repository).
> >Vertexes in the same partition would share this data structure.
> >
> >Is there any obvious technical fallacy in this scheme?
> >
> >Cheers,
> >--
> >Gianmarco
> >
> >
> >
> >On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
> ><al...@fb.com>wrote:
> >
> >> The example where we actually go out of memory was with 500K vertices
> >>and
> >> 500M edges, but yes, as a general rule we should strive to reduce our
> >> memory footprint in order to push the point where we need to go out of
> >> core as far away as possible.
> >>
> >> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <gd...@apache.org>
> >> wrote:
> >>
> >> >Very interesting.
> >> >
> >> >On a side note, a graph with 100k vertices and 100M edges is largish
> >>but
> >> >not that big after all.
> >> >If it does not fit on 10+ GB of memory, it means that each edge
> >>occupies
> >> >around 100B (amortizing the cost of the vertex over the edges).
> >> >In my opinion this deserves some thought.
> >> >If memory is an issue, why not think about compressed memory
> >>structures,
> >> >at
> >> >least for common graph formats?
> >> >
> >> >Cheers,
> >> >--
> >> >Gianmarco
> >> >
> >> >
> >> >
> >> >On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
> >> ><in...@gmail.com>wrote:
> >> >
> >> >> Great metrics, this made a very interesting read, and great code too
> >>as
> >> >> always. This must have been a lot of work. I like the idea of
> >> >>eliminating
> >> >> the extra temporary storage data structures where possible, even when
> >> >>not
> >> >> going out-of-core. I think that + avoiding extra object creation
> >>during
> >> >>the
> >> >> workflow can still do a lot for in-core job's memory profile, but
> >>this
> >> >>is
> >> >> looking really good and sounds like with the config options its also
> >> >> pluggable depending on your hardware situation, so it sounds great to
> >> >>me.
> >> >> Great work!
> >> >>
> >> >> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
> >> >> <ji...@apache.org>wrote:
> >> >>
> >> >> >
> >> >> >     [
> >> >> >
> >> >>
> >>
> >>>>
> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
> >>>>a
> >> .
> >>
> >>>>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
> >>>>#c
> >> >>omment-13435437
> >> >> ]
> >> >> >
> >> >> > Alessandro Presta commented on GIRAPH-249:
> >> >> > ------------------------------------------
> >> >> >
> >> >> > Thanks Claudio, good observation.
> >> >> > You got me curious so I quickly ran a shortest paths benchmark.
> >> >> >
> >> >> > 500k vertices, 100 edges/vertex, 10 workers
> >> >> >
> >> >> > This is with trunk:
> >> >> >
> >> >> > {code}
> >> >> > hadoop jar giraph-trunk.jar
> >> >> > org.apache.giraph.benchmark.ShortestPathsBenchmark
> >> >>-Dgiraph.useN



-- 
   Claudio Martella
   claudio.martella@gmail.com

Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.

Hi,
interesting discussion.

On Fri, Aug 17, 2012 at 11:03 PM, Alessandro Presta <al...@fb.com>wrote:

> So, should we start brainstorming on possible approaches?
>
> I'll begin by listing some assumptions we may (or may not) consider
> reasonable (most of these have already come up at some point from various
> people):
>
> - Most algorithms need to iterate over all edges/messages, hence we could
> stop offering random access. This should allow for compact/serialized
> representations and even streaming data as we compute.
>

I guess we could have different versions of basic vertexes that we could
provide, each with different features.
For example RandomAccessVertex could provide fast random access to edges at
the expense of more memory footprint, while SequentialAccessVertex could
save space by encoding the edges in some compact form (e.g. gap coding).


> - We need to reduce object instantiations. By keeping data serialized, we
> could re-use objects: think of a single vertex object that reads its
> serialized data and computes on the spot; edge/message iterators can be
> backed by a byte array/stream.
>

That's definitely a must. Object instantiation takes a long time, bloats
the memory and forces the GC to do extra work.
By keeping everything in ByteBuffers and using primitives we could remove
object instantiation completely.
The only problem, alas, is that Java generics don't play well with
primitives.


> - It may be desirable to always store at least the vertex ids in memory. A
> partition can become a container of ids instead of full vertices. We may
> restrict the vertex id type to something like 64bit integers if that makes
> life easier.
>

I guess that would do in most cases.
Though sometimes it might be convenient to use strings as IDs, maybe even
just for testing.

On the other hand, if we have a good random access structure that can
compress also the IDs, why not use it to implement a partition and keep
everything there?

Cheers,
--
Gianmarco


> On 8/17/12 7:20 PM, "Avery Ching" <ac...@apache.org> wrote:
>
> >Agreed, we should (and will be) moving more along these directions (byte
> >arrays / ByteBuffer) in the future for Giraph.
> >
> >On 8/17/12 8:36 AM, Sebastian Schelter wrote:
> >> Stratosphere even employs its own memory management by serializing data
> >> to preallocated byte arrays. This does not only allow for a very compact
> >> representation of the data, but also avoids major GC pauses and allows
> >> different buffer implementations to gracefully spill to disk.
> >>
> >> On 17.08.2012 17:17, Claudio Martella wrote:
> >>> Yes, that is definitely the direction you may want to take at a certain
> >>> moment. That is basically what Stanford gps does as well, and
> >>>stratosphere
> >>> too.
> >>>
> >>> On Friday, August 17, 2012, Alessandro Presta wrote:
> >>>
> >>>> I think at that point it would be worth having a new logical place for
> >>>> vertex/edge representation at worker- or partition-level.
> >>>> Avery had some ideas about this.
> >>>>
> >>>> Basically right now we're giving the user the freedom (and
> >>>>responsibility)
> >>>> to choose a representation (both in-memory and for serialization), but
> >>>> another way to go would be to take care of all that at infrastructure
> >>>> level and expose only one Vertex class (where the user only defines
> >>>>the
> >>>> computation details and everything else is abstracted away). Then we
> >>>>could
> >>>> play around with compact representations and even more disruptive
> >>>> strategies (like streaming the graph/messages and re-using objects).
> >>>>
> >>>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales"
> >>>><gdfm@apache.org<javascript:;>
> >>>> wrote:
> >>>>
> >>>>> I was under the impression that 100k was the upper limit to make
> >>>>>things
> >>>>> work without crashing.
> >>>>>
> >>>>> In any case, if one wanted to use a compressed memory representation
> >>>>>by
> >>>>> aggregating different edge lists together, could one use the worker
> >>>>> context
> >>>>> as a central point of access to the compressed graphs?
> >>>>> I can imagine a vertex class that has only the ID and uses the worker
> >>>>> context to access its edge list (i.e. it is only a client to a
> >>>>>central
> >>>>> per-machine repository).
> >>>>> Vertexes in the same partition would share this data structure.
> >>>>>
> >>>>> Is there any obvious technical fallacy in this scheme?
> >>>>>
> >>>>> Cheers,
> >>>>> --
> >>>>> Gianmarco
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
> >>>>> <al...@fb.com>wrote:
> >>>>>
> >>>>>> The example where we actually go out of memory was with 500K
> >>>>>>vertices
> >>>>>> and
> >>>>>> 500M edges, but yes, as a general rule we should strive to reduce
> >>>>>>our
> >>>>>> memory footprint in order to push the point where we need to go out
> >>>>>>of
> >>>>>> core as far away as possible.
> >>>>>>
> >>>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales"
> >>>>>><gd...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Very interesting.
> >>>>>>>
> >>>>>>> On a side note, a graph with 100k vertices and 100M edges is
> >>>>>>>largish
> >>>>>> but
> >>>>>>> not that big after all.
> >>>>>>> If it does not fit on 10+ GB of memory, it means that each edge
> >>>>>> occupies
> >>>>>>> around 100B (amortizing the cost of the vertex over the edges).
> >>>>>>> In my opinion this deserves some thought.
> >>>>>>> If memory is an issue, why not think about compressed memory
> >>>>>> structures,
> >>>>>>> at
> >>>>>>> least for common graph formats?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> --
> >>>>>>> Gianmarco
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
> >>>>>>> <in...@gmail.com>wrote:
> >>>>>>>
> >>>>>>>> Great metrics, this made a very interesting read, and great code
> >>>>>>>>too
> >>>>>> as
> >>>>>>>> always. This must have been a lot of work. I like the idea of
> >>>>>>>> eliminating
> >>>>>>>> the extra temporary storage data structures where possible, even
> >>>>>>>>when
> >>>>>>>> not
> >>>>>>>> going out-of-core. I think that + avoiding extra object creation
> >>>>>> during
> >>>>>>>> the
> >>>>>>>> workflow can still do a lot for in-core job's memory profile, but
> >>>>>> this
> >>>>>>>> is
> >>>>>>>> looking really good and sounds like with the config options its
> >>>>>>>>also
> >>>>>>>> pluggable depending on your hardware situation, so it sounds
> >>>>>>>>great to
> >>>>>>>> me.
> >>>>>>>> Great work!
> >>>>>>>>
> >>>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
> >>>>>>>> <ji...@apache.org>wrote:
> >>>>>>>>
> >>>>>>>>>      [
> >>>>>>>>>
> >>>>
> >>>>
> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
> >>>>>>>> a
> >>>>>> .
> >>>>>>
> >>>>>>>>
> >>>>>>>>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1343
> >>>>>>>>5437
> >>>>>>>> #c
> >>>>>>>> omment-13435437
> >>>>>>>> ]
> >>>>>>>>> Alessandro Presta commented on GIRAPH-249:
> >>>>>>>>> ------------------------------------------
> >>>>>>>>>
> >>>>>>>>> Thanks Claudio, good observation.
> >>>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
> >>>>>>>>>
> >>>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
> >>>>>>>>>
> >>>>>>>>> This is with trunk:
> >>>>>>>>>
> >>>>>>>>> {code}
> >>>>>>>>> hadoop jar giraph-trunk.jar
> >>>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
> >>>>>>>> -Dgiraph.useN
> >>>
> >>>
> >
>
>

Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low

Posted by Alessandro Presta <al...@fb.com>.

So, should we start brainstorming on possible approaches?

I'll begin by listing some assumptions we may (or may not) consider
reasonable (most of these have already come up at some point from various
people):

- Most algorithms need to iterate over all edges/messages, hence we could
stop offering random access. This should allow for compact/serialized
representations and even streaming data as we compute.
- We need to reduce object instantiations. By keeping data serialized, we
could re-use objects: think of a single vertex object that reads its
serialized data and computes on the spot; edge/message iterators can be
backed by a byte array/stream.
- It may be desirable to always store at least the vertex ids in memory. A
partition can become a container of ids instead of full vertices. We may
restrict the vertex id type to something like 64bit integers if that makes
life easier.

On 8/17/12 7:20 PM, "Avery Ching" <ac...@apache.org> wrote:

>Agreed, we should (and will be) moving more along these directions (byte
>arrays / ByteBuffer) in the future for Giraph.
>
>On 8/17/12 8:36 AM, Sebastian Schelter wrote:
>> Stratosphere even employs its own memory management by serializing data
>> to preallocated byte arrays. This does not only allow for a very compact
>> representation of the data, but also avoids major GC pauses and allows
>> different buffer implementations to gracefully spill to disk.
>>
>> On 17.08.2012 17:17, Claudio Martella wrote:
>>> Yes, that is definitely the direction you may want to take at a certain
>>> moment. That is basically what Stanford gps does as well, and
>>>stratosphere
>>> too.
>>>
>>> On Friday, August 17, 2012, Alessandro Presta wrote:
>>>
>>>> I think at that point it would be worth having a new logical place for
>>>> vertex/edge representation at worker- or partition-level.
>>>> Avery had some ideas about this.
>>>>
>>>> Basically right now we're giving the user the freedom (and
>>>>responsibility)
>>>> to choose a representation (both in-memory and for serialization), but
>>>> another way to go would be to take care of all that at infrastructure
>>>> level and expose only one Vertex class (where the user only defines
>>>>the
>>>> computation details and everything else is abstracted away). Then we
>>>>could
>>>> play around with compact representations and even more disruptive
>>>> strategies (like streaming the graph/messages and re-using objects).
>>>>
>>>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales"
>>>><gdfm@apache.org<javascript:;>
>>>> wrote:
>>>>
>>>>> I was under the impression that 100k was the upper limit to make
>>>>>things
>>>>> work without crashing.
>>>>>
>>>>> In any case, if one wanted to use a compressed memory representation
>>>>>by
>>>>> aggregating different edge lists together, could one use the worker
>>>>> context
>>>>> as a central point of access to the compressed graphs?
>>>>> I can imagine a vertex class that has only the ID and uses the worker
>>>>> context to access its edge list (i.e. it is only a client to a
>>>>>central
>>>>> per-machine repository).
>>>>> Vertexes in the same partition would share this data structure.
>>>>>
>>>>> Is there any obvious technical fallacy in this scheme?
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Gianmarco
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
>>>>> <al...@fb.com>wrote:
>>>>>
>>>>>> The example where we actually go out of memory was with 500K
>>>>>>vertices
>>>>>> and
>>>>>> 500M edges, but yes, as a general rule we should strive to reduce
>>>>>>our
>>>>>> memory footprint in order to push the point where we need to go out
>>>>>>of
>>>>>> core as far away as possible.
>>>>>>
>>>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales"
>>>>>><gd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Very interesting.
>>>>>>>
>>>>>>> On a side note, a graph with 100k vertices and 100M edges is
>>>>>>>largish
>>>>>> but
>>>>>>> not that big after all.
>>>>>>> If it does not fit on 10+ GB of memory, it means that each edge
>>>>>> occupies
>>>>>>> around 100B (amortizing the cost of the vertex over the edges).
>>>>>>> In my opinion this deserves some thought.
>>>>>>> If memory is an issue, why not think about compressed memory
>>>>>> structures,
>>>>>>> at
>>>>>>> least for common graph formats?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> --
>>>>>>> Gianmarco
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
>>>>>>> <in...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Great metrics, this made a very interesting read, and great code
>>>>>>>>too
>>>>>> as
>>>>>>>> always. This must have been a lot of work. I like the idea of
>>>>>>>> eliminating
>>>>>>>> the extra temporary storage data structures where possible, even
>>>>>>>>when
>>>>>>>> not
>>>>>>>> going out-of-core. I think that + avoiding extra object creation
>>>>>> during
>>>>>>>> the
>>>>>>>> workflow can still do a lot for in-core job's memory profile, but
>>>>>> this
>>>>>>>> is
>>>>>>>> looking really good and sounds like with the config options its
>>>>>>>>also
>>>>>>>> pluggable depending on your hardware situation, so it sounds
>>>>>>>>great to
>>>>>>>> me.
>>>>>>>> Great work!
>>>>>>>>
>>>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
>>>>>>>> <ji...@apache.org>wrote:
>>>>>>>>
>>>>>>>>>      [
>>>>>>>>>
>>>> 
>>>>https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
>>>>>>>> a
>>>>>> .
>>>>>>
>>>>>>>> 
>>>>>>>>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1343
>>>>>>>>5437
>>>>>>>> #c
>>>>>>>> omment-13435437
>>>>>>>> ]
>>>>>>>>> Alessandro Presta commented on GIRAPH-249:
>>>>>>>>> ------------------------------------------
>>>>>>>>>
>>>>>>>>> Thanks Claudio, good observation.
>>>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
>>>>>>>>>
>>>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
>>>>>>>>>
>>>>>>>>> This is with trunk:
>>>>>>>>>
>>>>>>>>> {code}
>>>>>>>>> hadoop jar giraph-trunk.jar
>>>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
>>>>>>>> -Dgiraph.useN
>>>
>>>
>

Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low

Posted by Avery Ching <ac...@apache.org>.

Agreed, we should (and will be) moving more along these directions (byte 
arrays / ByteBuffer) in the future for Giraph.

On 8/17/12 8:36 AM, Sebastian Schelter wrote:
> Stratosphere even employs its own memory management by serializing data
> to preallocated byte arrays. This does not only allow for a very compact
> representation of the data, but also avoids major GC pauses and allows
> different buffer implementations to gracefully spill to disk.
>
> On 17.08.2012 17:17, Claudio Martella wrote:
>> Yes, that is definitely the direction you may want to take at a certain
>> moment. That is basically what Stanford gps does as well, and stratosphere
>> too.
>>
>> On Friday, August 17, 2012, Alessandro Presta wrote:
>>
>>> I think at that point it would be worth having a new logical place for
>>> vertex/edge representation at worker- or partition-level.
>>> Avery had some ideas about this.
>>>
>>> Basically right now we're giving the user the freedom (and responsibility)
>>> to choose a representation (both in-memory and for serialization), but
>>> another way to go would be to take care of all that at infrastructure
>>> level and expose only one Vertex class (where the user only defines the
>>> computation details and everything else is abstracted away). Then we could
>>> play around with compact representations and even more disruptive
>>> strategies (like streaming the graph/messages and re-using objects).
>>>
>>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org<javascript:;>
>>> wrote:
>>>
>>>> I was under the impression that 100k was the upper limit to make things
>>>> work without crashing.
>>>>
>>>> In any case, if one wanted to use a compressed memory representation by
>>>> aggregating different edge lists together, could one use the worker
>>>> context
>>>> as a central point of access to the compressed graphs?
>>>> I can imagine a vertex class that has only the ID and uses the worker
>>>> context to access its edge list (i.e. it is only a client to a central
>>>> per-machine repository).
>>>> Vertexes in the same partition would share this data structure.
>>>>
>>>> Is there any obvious technical fallacy in this scheme?
>>>>
>>>> Cheers,
>>>> --
>>>> Gianmarco
>>>>
>>>>
>>>>
>>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
>>>> <al...@fb.com>wrote:
>>>>
>>>>> The example where we actually go out of memory was with 500K vertices
>>>>> and
>>>>> 500M edges, but yes, as a general rule we should strive to reduce our
>>>>> memory footprint in order to push the point where we need to go out of
>>>>> core as far away as possible.
>>>>>
>>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <gd...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Very interesting.
>>>>>>
>>>>>> On a side note, a graph with 100k vertices and 100M edges is largish
>>>>> but
>>>>>> not that big after all.
>>>>>> If it does not fit on 10+ GB of memory, it means that each edge
>>>>> occupies
>>>>>> around 100B (amortizing the cost of the vertex over the edges).
>>>>>> In my opinion this deserves some thought.
>>>>>> If memory is an issue, why not think about compressed memory
>>>>> structures,
>>>>>> at
>>>>>> least for common graph formats?
>>>>>>
>>>>>> Cheers,
>>>>>> --
>>>>>> Gianmarco
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
>>>>>> <in...@gmail.com>wrote:
>>>>>>
>>>>>>> Great metrics, this made a very interesting read, and great code too
>>>>> as
>>>>>>> always. This must have been a lot of work. I like the idea of
>>>>>>> eliminating
>>>>>>> the extra temporary storage data structures where possible, even when
>>>>>>> not
>>>>>>> going out-of-core. I think that + avoiding extra object creation
>>>>> during
>>>>>>> the
>>>>>>> workflow can still do a lot for in-core job's memory profile, but
>>>>> this
>>>>>>> is
>>>>>>> looking really good and sounds like with the config options its also
>>>>>>> pluggable depending on your hardware situation, so it sounds great to
>>>>>>> me.
>>>>>>> Great work!
>>>>>>>
>>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
>>>>>>> <ji...@apache.org>wrote:
>>>>>>>
>>>>>>>>      [
>>>>>>>>
>>> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
>>>>>>> a
>>>>> .
>>>>>
>>>>>>> plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
>>>>>>> #c
>>>>>>> omment-13435437
>>>>>>> ]
>>>>>>>> Alessandro Presta commented on GIRAPH-249:
>>>>>>>> ------------------------------------------
>>>>>>>>
>>>>>>>> Thanks Claudio, good observation.
>>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
>>>>>>>>
>>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
>>>>>>>>
>>>>>>>> This is with trunk:
>>>>>>>>
>>>>>>>> {code}
>>>>>>>> hadoop jar giraph-trunk.jar
>>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
>>>>>>> -Dgiraph.useN
>>
>>

Re: [jira] [Created] (GIRAPH-249) Move part of the graph out-of-core when memory is low

Posted by Sebastian Schelter <ss...@apache.org>.

Stratosphere even employs its own memory management by serializing data
to preallocated byte arrays. This does not only allow for a very compact
representation of the data, but also avoids major GC pauses and allows
different buffer implementations to gracefully spill to disk.

On 17.08.2012 17:17, Claudio Martella wrote:
> Yes, that is definitely the direction you may want to take at a certain
> moment. That is basically what Stanford gps does as well, and stratosphere
> too.
> 
> On Friday, August 17, 2012, Alessandro Presta wrote:
> 
>> I think at that point it would be worth having a new logical place for
>> vertex/edge representation at worker- or partition-level.
>> Avery had some ideas about this.
>>
>> Basically right now we're giving the user the freedom (and responsibility)
>> to choose a representation (both in-memory and for serialization), but
>> another way to go would be to take care of all that at infrastructure
>> level and expose only one Vertex class (where the user only defines the
>> computation details and everything else is abstracted away). Then we could
>> play around with compact representations and even more disruptive
>> strategies (like streaming the graph/messages and re-using objects).
>>
>> On 8/17/12 2:30 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org<javascript:;>
>>>
>> wrote:
>>
>>> I was under the impression that 100k was the upper limit to make things
>>> work without crashing.
>>>
>>> In any case, if one wanted to use a compressed memory representation by
>>> aggregating different edge lists together, could one use the worker
>>> context
>>> as a central point of access to the compressed graphs?
>>> I can imagine a vertex class that has only the ID and uses the worker
>>> context to access its edge list (i.e. it is only a client to a central
>>> per-machine repository).
>>> Vertexes in the same partition would share this data structure.
>>>
>>> Is there any obvious technical fallacy in this scheme?
>>>
>>> Cheers,
>>> --
>>> Gianmarco
>>>
>>>
>>>
>>> On Fri, Aug 17, 2012 at 3:18 PM, Alessandro Presta
>>> <al...@fb.com>wrote:
>>>
>>>> The example where we actually go out of memory was with 500K vertices
>>>> and
>>>> 500M edges, but yes, as a general rule we should strive to reduce our
>>>> memory footprint in order to push the point where we need to go out of
>>>> core as far away as possible.
>>>>
>>>> On 8/17/12 2:11 PM, "Gianmarco De Francisci Morales" <gd...@apache.org>
>>>> wrote:
>>>>
>>>>> Very interesting.
>>>>>
>>>>> On a side note, a graph with 100k vertices and 100M edges is largish
>>>> but
>>>>> not that big after all.
>>>>> If it does not fit on 10+ GB of memory, it means that each edge
>>>> occupies
>>>>> around 100B (amortizing the cost of the vertex over the edges).
>>>>> In my opinion this deserves some thought.
>>>>> If memory is an issue, why not think about compressed memory
>>>> structures,
>>>>> at
>>>>> least for common graph formats?
>>>>>
>>>>> Cheers,
>>>>> --
>>>>> Gianmarco
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 15, 2012 at 11:20 PM, Eli Reisman
>>>>> <in...@gmail.com>wrote:
>>>>>
>>>>>> Great metrics, this made a very interesting read, and great code too
>>>> as
>>>>>> always. This must have been a lot of work. I like the idea of
>>>>>> eliminating
>>>>>> the extra temporary storage data structures where possible, even when
>>>>>> not
>>>>>> going out-of-core. I think that + avoiding extra object creation
>>>> during
>>>>>> the
>>>>>> workflow can still do a lot for in-core job's memory profile, but
>>>> this
>>>>>> is
>>>>>> looking really good and sounds like with the config options its also
>>>>>> pluggable depending on your hardware situation, so it sounds great to
>>>>>> me.
>>>>>> Great work!
>>>>>>
>>>>>> On Wed, Aug 15, 2012 at 12:23 PM, Alessandro Presta (JIRA)
>>>>>> <ji...@apache.org>wrote:
>>>>>>
>>>>>>>
>>>>>>>     [
>>>>>>>
>>>>>>
>>>>
>>>>>>
>> https://issues.apache.org/jira/browse/GIRAPH-249?page=com.atlassian.jir
>>>>>> a
>>>> .
>>>>
>>>>>> plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435437
>>>>>> #c
>>>>>> omment-13435437
>>>>>> ]
>>>>>>>
>>>>>>> Alessandro Presta commented on GIRAPH-249:
>>>>>>> ------------------------------------------
>>>>>>>
>>>>>>> Thanks Claudio, good observation.
>>>>>>> You got me curious so I quickly ran a shortest paths benchmark.
>>>>>>>
>>>>>>> 500k vertices, 100 edges/vertex, 10 workers
>>>>>>>
>>>>>>> This is with trunk:
>>>>>>>
>>>>>>> {code}
>>>>>>> hadoop jar giraph-trunk.jar
>>>>>>> org.apache.giraph.benchmark.ShortestPathsBenchmark
>>>>>> -Dgiraph.useN
> 
> 
>