You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by Salikh Zakirov <Sa...@Intel.com> on 2006/11/14 11:25:29 UTC

[design] Class unloading: false sharing of vtable marks

As we discussed before, the VTable marks approach [1] has a "false sharing" problem
on a multiprocessor:

when one thread is writing to vtable mark, it is invalidating respective cache line
in other processor caches. Meanwhile, since gcmaps, located near vtable marks,
are loaded frequently during heap tracing, the same cache line will be loaded
and invalidated repeatedly, leading to huge load to memory bus and harming performance.

*Illustration*: original "VTable marks" suggestion applied to current DRLVM object layout.

   object            VTable                   gcmap
 +--------+        +-----------+            +------------------+
 | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
 |  ...   |        |   mark    |            | offset of ref #2 |
 +--------+        +    ...    |            |       ...        |
                   +-----------+            |        0         |
                                            +------------------+

I would like suggest solution to false sharing problem using additional
level of indirection, that is, to store the _pointer to mark word_  in VTable
rather than mark word itself.

*Illustration*: "indirect VTable marks" suggestion

   object            VTable                   gcmap
 +--------+        +-----------+            +------------------+
 | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
 |  ...   |        |  mark ptr |---,        | offset of ref #2 |
 +--------+        +    ...    |   |        |       ...        |
                   +-----------+   |        |        0         |
                                   |        +------------------+
                                   v
                              [mark word]

I do not think this will hurt performance significantly in comparison with original
"vtable marks" approach, because, additional load of mark_ptr is very likely
to be served from the first-level cache, because it happens at the same time
as gcmap_ptr load. (If the mark_ptr is loaded first, then subsequent load of gcmap_ptr
will be served from cache, so no additional memory load overhead anyway).

In current DRLVM design [2], each VTable already have pointers to native Class structure:

    Class* clss;

It looks like the same pointer can be reused for VTable mark word, if we allocate
VTable mark word as the first word of struct Class.
In this way, even the size of VTable structure will not be changed comparing
to current size. The resulting object layout diagram would be

*Illustration*: "indirect VTable marks stored in struct Class"

   object            VTable                   gcmap
 +--------+        +-----------+            +------------------+
 | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
 |  ...   |        |  clss ptr |---,        | offset of ref #2 |
 +--------+        +    ...    |   |        |       ...        |
                   +-----------+   |        |        0         |
                                   |        +------------------+
                                   v
                             +-----------+
                             | mark word |
                             |    ...    |
                             +-----------+ 
                              struct Class


Robin suggested "side byte-map" as another solution to the same false sharing problem. 
As I do not completely understand how this side byte-map would be implemented, 
I do not know if it is similar to this suggestion.

Robin, could you comment on it?

[1] http://wiki.apache.org/harmony/ClassUnloading
[2] http://svn.apache.org/viewvc/incubator/harmony/enhanced/drlvm/trunk/vm/vmcore/include/vtable.h?view=co

(*
This is a follow-up to design proposals at 
http://wiki.apache.org/harmony/ClassUnloading

I am starting new discussion because mailing list is a better means for discussion
than Wiki. After we come to conclusion, I will log it to the wiki page.
*)

Re: [drlvm] Re: [design] Class unloading: false sharing of vtable marks

Posted by Weldon Washburn <we...@gmail.com>.

On 11/14/06, Robin Garner <ro...@anu.edu.au> wrote:
>
> Salikh Zakirov wrote:
> > As we discussed before, the VTable marks approach [1] has a "false
> sharing" problem
> > on a multiprocessor:
> >
> > when one thread is writing to vtable mark, it is invalidating respective
> cache line
> > in other processor caches. Meanwhile, since gcmaps, located near vtable
> marks,
> > are loaded frequently during heap tracing, the same cache line will be
> loaded
> > and invalidated repeatedly, leading to huge load to memory bus and
> harming performance.
> >
> > *Illustration*: original "VTable marks" suggestion applied to current
> DRLVM object layout.
> >
> >    object            VTable                   gcmap
> >  +--------+        +-----------+            +------------------+
> >  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
> >  |  ...   |        |   mark    |            | offset of ref #2 |
> >  +--------+        +    ...    |            |       ...        |
> >                    +-----------+            |        0         |
> >                                             +------------------+
> >
> > I would like suggest solution to false sharing problem using additional
> > level of indirection, that is, to store the _pointer to mark word_  in
> VTable
> > rather than mark word itself.
> >
> > *Illustration*: "indirect VTable marks" suggestion
> >
> >    object            VTable                   gcmap
> >  +--------+        +-----------+            +------------------+
> >  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
> >  |  ...   |        |  mark ptr |---,        | offset of ref #2 |
> >  +--------+        +    ...    |   |        |       ...        |
> >                    +-----------+   |        |        0         |
> >                                    |        +------------------+
> >                                    v
> >                               [mark word]
> >
> > I do not think this will hurt performance significantly in comparison
> with original
> > "vtable marks" approach, because, additional load of mark_ptr is very
> likely
> > to be served from the first-level cache, because it happens at the same
> time
> > as gcmap_ptr load. (If the mark_ptr is loaded first, then subsequent
> load of gcmap_ptr
> > will be served from cache, so no additional memory load overhead
> anyway).
> >
> > In current DRLVM design [2], each VTable already have pointers to native
> Class structure:
> >
> >     Class* clss;
> >
> > It looks like the same pointer can be reused for VTable mark word, if we
> allocate
> > VTable mark word as the first word of struct Class.
> > In this way, even the size of VTable structure will not be changed
> comparing
> > to current size. The resulting object layout diagram would be
> >
> > *Illustration*: "indirect VTable marks stored in struct Class"
> >
> >    object            VTable                   gcmap
> >  +--------+        +-----------+            +------------------+
> >  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
> >  |  ...   |        |  clss ptr |---,        | offset of ref #2 |
> >  +--------+        +    ...    |   |        |       ...        |
> >                    +-----------+   |        |        0         |
> >                                    |        +------------------+
> >                                    v
> >                              +-----------+
> >                              | mark word |
> >                              |    ...    |
> >                              +-----------+
> >                               struct Class
>
> Whether this helps performance depends on the cache policy of the
> multiprocessor.  I'm not sufficiently versed in cache architectures to
> say, but I would expect that machines with sufficiently weak memory
> models will make this cheap, those without will be expensive.


Good point.  My guess is that it will be hard to finish this part of the
design without trying out some different implementations on selected HW.
There may be factors in addition to false sharing of vtable marks that do
not show up in debate.  But do show up in experiment.

>
> > Robin suggested "side byte-map" as another solution to the same false
> sharing problem.
> > As I do not completely understand how this side byte-map would be
> implemented,
> > I do not know if it is similar to this suggestion.
> >
> > Robin, could you comment on it?
> >
> > [1] http://wiki.apache.org/harmony/ClassUnloading
> > [2]
> http://svn.apache.org/viewvc/incubator/harmony/enhanced/drlvm/trunk/vm/vmcore/include/vtable.h?view=co
>
> Hi Salikh,
> I nearly missed this post because I usually filter on [drlvm], but I'm
> happy to explain.
>
> Metadata 'on the side' is a standard technique in GC.  Take the simplest
> case, where you want to do mark-sweep, but don't have any space in the
> object to do so.
>
> Assume (initially, for maximum simplicity) that all objects are of the
> same size (n bytes).  Allocate an array of booleans of size
> (heap-bytes/n), marks[].  So the nth object is represented by marks[n].
>
> In reality, objects have variable sizes, so you can relax the conditions
> by allocating an array of (heap/max-align), and marking the boolean
> corresponding to the header of every object.
>
> In MMTk's side mark-bit implementation, we use distributed metadata,
> specifically, when we allocate 4MB of virtual addresses, we allocate
> 2^22/2^6 = 2^16 bits = 8k bytes of 'side bitmap' at the beginning of the
> 4MB chunk.
>
> Calculating the address of the mark bit is a simple mask and shift
> operation on the address of the mark bit (pure register/ALU, v fast,
> overlapped with memory fetches).
>
> hope this helps,
> Robin
>
> > (*
> > This is a follow-up to design proposals at
> > http://wiki.apache.org/harmony/ClassUnloading
> >
> > I am starting new discussion because mailing list is a better means for
> discussion
> > than Wiki. After we come to conclusion, I will log it to the wiki page.
> > *)
> >
> >
>
>
> --
> Robin Garner
> Dept. of Computer Science
> Australian National University
> http://cs.anu.edu.au/people/Robin.Garner/
>



-- 
Weldon Washburn
Intel Enterprise Solutions Software Division

Re: [drlvm] Re: [design] Class unloading: false sharing of vtable marks

Posted by Rana Dasgupta <rd...@gmail.com>.

On 11/14/06, Robin Garner <ro...@anu.edu.au> wrote:
>
>
> >Whether this helps performance depends on the cache policy of the
> >multiprocessor.  I'm not sufficiently versed in cache architectures to
> >say, but I would expect that machines with sufficiently weak memory
> >models will make this cheap, those without will be expensive.

Salikh points out an interesting issue, but I think that this may be subject
to some assumptions about  SMP and hypertheaded architectures, and prevalent
L2 sharing models, at least on certain processors. This is a rapidly
changing area and as Weldon points out, the best approach may be to do the
prototype, and then run analytical tools to identify the pathologies and
tweak accordingly. Upfront, the overhead of a level of indirection, the
benefits of sharing the same cache line as the gcmap ptr and the cost of the
false sharing on some architectures is diffcult to figure out without
prototyping.

[drlvm] Re: [design] Class unloading: false sharing of vtable marks

Posted by Robin Garner <ro...@anu.edu.au>.

Salikh Zakirov wrote:
> As we discussed before, the VTable marks approach [1] has a "false sharing" problem
> on a multiprocessor:
> 
> when one thread is writing to vtable mark, it is invalidating respective cache line
> in other processor caches. Meanwhile, since gcmaps, located near vtable marks,
> are loaded frequently during heap tracing, the same cache line will be loaded
> and invalidated repeatedly, leading to huge load to memory bus and harming performance.
> 
> *Illustration*: original "VTable marks" suggestion applied to current DRLVM object layout.
> 
>    object            VTable                   gcmap
>  +--------+        +-----------+            +------------------+
>  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
>  |  ...   |        |   mark    |            | offset of ref #2 |
>  +--------+        +    ...    |            |       ...        |
>                    +-----------+            |        0         |
>                                             +------------------+
> 
> I would like suggest solution to false sharing problem using additional
> level of indirection, that is, to store the _pointer to mark word_  in VTable
> rather than mark word itself.
> 
> *Illustration*: "indirect VTable marks" suggestion
> 
>    object            VTable                   gcmap
>  +--------+        +-----------+            +------------------+
>  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
>  |  ...   |        |  mark ptr |---,        | offset of ref #2 |
>  +--------+        +    ...    |   |        |       ...        |
>                    +-----------+   |        |        0         |
>                                    |        +------------------+
>                                    v
>                               [mark word]
> 
> I do not think this will hurt performance significantly in comparison with original
> "vtable marks" approach, because, additional load of mark_ptr is very likely
> to be served from the first-level cache, because it happens at the same time
> as gcmap_ptr load. (If the mark_ptr is loaded first, then subsequent load of gcmap_ptr
> will be served from cache, so no additional memory load overhead anyway).
> 
> In current DRLVM design [2], each VTable already have pointers to native Class structure:
> 
>     Class* clss;
> 
> It looks like the same pointer can be reused for VTable mark word, if we allocate
> VTable mark word as the first word of struct Class.
> In this way, even the size of VTable structure will not be changed comparing
> to current size. The resulting object layout diagram would be
> 
> *Illustration*: "indirect VTable marks stored in struct Class"
> 
>    object            VTable                   gcmap
>  +--------+        +-----------+            +------------------+
>  | VT ptr |------->| gcmap ptr |----------->| offset of ref #1 |
>  |  ...   |        |  clss ptr |---,        | offset of ref #2 |
>  +--------+        +    ...    |   |        |       ...        |
>                    +-----------+   |        |        0         |
>                                    |        +------------------+
>                                    v
>                              +-----------+
>                              | mark word |
>                              |    ...    |
>                              +-----------+ 
>                               struct Class

Whether this helps performance depends on the cache policy of the 
multiprocessor.  I'm not sufficiently versed in cache architectures to 
say, but I would expect that machines with sufficiently weak memory 
models will make this cheap, those without will be expensive.

> 
> Robin suggested "side byte-map" as another solution to the same false sharing problem. 
> As I do not completely understand how this side byte-map would be implemented, 
> I do not know if it is similar to this suggestion.
> 
> Robin, could you comment on it?
> 
> [1] http://wiki.apache.org/harmony/ClassUnloading
> [2] http://svn.apache.org/viewvc/incubator/harmony/enhanced/drlvm/trunk/vm/vmcore/include/vtable.h?view=co

Hi Salikh,
I nearly missed this post because I usually filter on [drlvm], but I'm 
happy to explain.

Metadata 'on the side' is a standard technique in GC.  Take the simplest 
case, where you want to do mark-sweep, but don't have any space in the 
object to do so.

Assume (initially, for maximum simplicity) that all objects are of the 
same size (n bytes).  Allocate an array of booleans of size 
(heap-bytes/n), marks[].  So the nth object is represented by marks[n].

In reality, objects have variable sizes, so you can relax the conditions 
by allocating an array of (heap/max-align), and marking the boolean 
corresponding to the header of every object.

In MMTk's side mark-bit implementation, we use distributed metadata, 
specifically, when we allocate 4MB of virtual addresses, we allocate 
2^22/2^6 = 2^16 bits = 8k bytes of 'side bitmap' at the beginning of the 
4MB chunk.

Calculating the address of the mark bit is a simple mask and shift 
operation on the address of the mark bit (pure register/ALU, v fast, 
overlapped with memory fetches).

hope this helps,
Robin

> (*
> This is a follow-up to design proposals at 
> http://wiki.apache.org/harmony/ClassUnloading
> 
> I am starting new discussion because mailing list is a better means for discussion
> than Wiki. After we come to conclusion, I will log it to the wiki page.
> *)
> 
> 


-- 
Robin Garner
Dept. of Computer Science
Australian National University
http://cs.anu.edu.au/people/Robin.Garner/