You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by zouqiong <us...@gmail.com> on 2006/08/17 08:41:31 UTC

Re: performance of drlvm

Hi, Mikhail:
    These days i have read a lot of paper about dynamic optimization. I got
the following ideas
to profile the drlvm with PMU:
1. Prefetch Injection Based on Hardware Monitoring and Object Metadata: this
paper is a good
example about how to PMU to dynamic prefetch data with high cache miss rate.
I want to
implement this idea in DRLVM.
2. The idea from Chilimbi. He got the trace by instrumented the code, which
causes high overhead.  I want to improve his implementation. Firstly use the
hardware performance counter
to get the data reference trace, and  abstracts the trace. Secondly use
SEQUITUR algorithm to simplify the trace and represent the trace with DAG
from which we can easily get the hot data stream (follow Chilimbi`s way).
This can instruct how to prefetch the data with high miss rate.

    As for the performance monitor tool, i will think about it later.
Please point out the defect or mistake. Thanks.

Re: performance of drlvm

Posted by Egor Pasko <eg...@gmail.com>.

On the 0x1CF day of Apache Harmony ustczz@gmail.com wrote:
> Hi Egor,
>        Thanks very much for your reply. But I can`t cache up with your
> mean. I must spend a period of time to think about it. And I wil discuss
> with you later.

OK, see you soon :)

P.S.: I am not an expert in sampling profilers, the idea can be simply wrong :)

-- 
Egor Pasko, Intel Managed Runtime Division


---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: harmony-dev-unsubscribe@incubator.apache.org
For additional commands, e-mail: harmony-dev-help@incubator.apache.org

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

Hi Egor,
       Thanks very much for your reply. But I can`t cache up with your
mean. I must spend a period of time to think about it. And I wil discuss
with you later.


-- 
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by Egor Pasko <eg...@gmail.com>.

On the 0x1CF day of Apache Harmony ustczz@gmail.com wrote:
> "An obvious candidate for abstracting data addresses is to use the
> name or type of the object that generates the address. While this
> apporach works for static objects, heap addresses present an
> additional challenge. Heap space is reused as heap data is
> dynamically allocated and freed. Consequently, the same heap address
> can correspond to different program objects during execution....."
> 
> It`s the way Chilimbi used in C or C++ programs. Maybe in jvm, we
> can get the name or type of the object that generate the address
> more easily??? I need your clue.

Hm, there can be a lot of objects with the same name. And many of them
freed at runtime which makes it not so easy to track objects by names..

I have an idea that seems simple but effective: wipe out trace
counters for addresses that are freed during GC. Yes, this needs a GC
support. Hopefully, it would make us a good sampling for objects in
older generations. Stack access can be filtered out without much pain.

Sorry for my interrupting :)

-- 
Egor Pasko, Intel Managed Runtime Division


---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: harmony-dev-unsubscribe@incubator.apache.org
For additional commands, e-mail: harmony-dev-help@incubator.apache.org

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

Hi Mikhai,
As for your listed 3, we need to abstract the Data reference trace. And we
won`t recode the data reference trace in all iterations of the loop, we just
recode
for 100, or maybe less. So the trace won`t be too large. And Chilimbi prove
only
profiling 50 or less iterations is suffice. The problem we must think
seriously is
how to abstract the Data reference trace.

Chilimbi introduced his way to abstract the data reference trace [Efficient
Representations and Abstractions for Quantifying and Exploiting Data
Reference
Locality], I will list it.

"An obvious candidate for abstracting data addresses is to use the name or
type
of the object that generates the address. While this apporach works for
static
objects, heap addresses present an additional challenge. Heap space is
reused
as heap data is dynamically allocated and freed. Consequently, the same heap
address can correspond to different program objects during execution....."

It`s the way Chilimbi used in C or C++ programs. Maybe in jvm, we can get
the
name or type of the object that generate the address more easily??? I need
your
clue.




-- 
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by Weldon Washburn <we...@gmail.com>.

On 8/24/06, zouqiong <us...@gmail.com> wrote:
> En, I list some paper I have read.
>
> 1.Efficient Representations and Abstractions for Quantifying and Exploiting
> Data Reference
>   Locality.
> 2.Bursty Tracing: A Framework for Low-Overhead Temporal Profiling
> 3.Dynamic Hot Data Stream Prefetching for General-Purpose Programs
> 4.Profile-guided Proactive Garbage Collection for Locality Optimization
>
> list 1-4 provide the framework to trace, and the algorithm to process the
> trace.
>
> If you want to learn more about SEQUITUR algorithm, you can read
> 5.Whole Program Paths
>
> All these paper can be found by using Goole. From these paper, you can
> find more
> that you need to read.
>
> And the problems I faced now are:
> 1. Which library to choose to profile the Data Reference Trace. PAPI? We
> need to
> know the pc of the instructions which incur most of the cache miss. I am
> afraid of
> the profiling overhead.
>
> 2.How to abstract the Data Reference Trace for JAVA? Egor had given some
> clue
> on it.
>
> And now I want to know the schedule for this project.

The project schedule will appear on this mailing list.  Please be
patient as it is an open source project.  It relies on volunteers to
make the schedule real.

> I expect to join it.

Excellent!  We need folks involved in Harmony who aggressively read
the GC literature.

> But I am not
> familiar with GC and interface for profile in drlvm, can you give me some
> clue or
> suggestion to learn them as soon as possible. Thanks.

There are several things happening in drlvm gc that you should be
aware of.  There is an effort to design and build a conventional
generational mark/sweep compact collector.  You will see this appear
as "GCV5" on harmony-dev.  There is also an effort to port a well
known research GC called MMTk.  I have been doing the MMTk port and
will be reporting status to harmony-dev in the next day or two.
Although GCV4 is what's available right now, you may find either GCV5
or MMTk more suitable for you work.

Its been a while since I looked at Chilimbi's work.  It might be
worthwhile doing a google on "Chilimbi MMTk" to see if there is any
intersection.

-- 
Weldon Washburn
Intel Middleware Products Division

---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: harmony-dev-unsubscribe@incubator.apache.org
For additional commands, e-mail: harmony-dev-help@incubator.apache.org

Re: performance of drlvm

Posted by Ivan Volosyuk <iv...@gmail.com>.

On 8/25/06, Mikhail Fursov <mi...@gmail.com> wrote:
> On 8/25/06, zouqiong <us...@gmail.com> wrote:
...
> This is the key problem. I hope I will have an opinion on it when I read all
> of the articles you mentioned.
> The only idea I have now is that using an address of an object is not a
> problem. Once GC moves objects we can ask it to patch our references too.
>
> And now I want to know the schedule for this project.  I expect to join it.
> > But I am not
> > familiar with GC and interface for profile in drlvm, can you give me some
> > clue or
> > suggestion to learn them as soon as possible. Thanks.

What is your question about the GC?
If you can use enumeratable GC references, they will be automatically
repointed by garbage collector. In your case it may be even better to
use weak roots and weak references, if you don't want to hold object
when having the last reference to it.
--
Ivan


> I'm also not a guru in GC, but I think there are a lot of people here who
> know the answers and can help in the area they are interested.
> So the proposal is:
> 1) You make the schedule and lead this project. I can help you with dynamic
> recompilation scenarios implementations and with JIT code generation and
> optimizations.
> 2) Why do not to implement classic Chilimbi algorithm first and enhance it
> with PMU data when it's tested?

---------------------------------------------------------------------
Terms of use : http://incubator.apache.org/harmony/mailing.html
To unsubscribe, e-mail: harmony-dev-unsubscribe@incubator.apache.org
For additional commands, e-mail: harmony-dev-help@incubator.apache.org

Re: performance of drlvm

Posted by Mikhail Fursov <mi...@gmail.com>.

On 8/25/06, zouqiong <us...@gmail.com> wrote:
>
> En, I list some paper I have read.
>
> 1.Efficient Representations and Abstractions for Quantifying and
> Exploiting
> Data Reference
>    Locality.
> 2.Bursty Tracing: A Framework for Low-Overhead Temporal Profiling
> 3.Dynamic Hot Data Stream Prefetching for General-Purpose Programs
> 4.Profile-guided Proactive Garbage Collection for Locality Optimization
>
> list 1-4 provide the framework to trace, and the algorithm to process the
> trace.

Ok, I will read all of them on this weekend and hope I will be  more
educated in algorithm you decided to use. Now, I read only 3rd paper and see
that this is not enough to  understand all of the details.

And the problems I faced now are:
> 1. Which library to choose to profile the Data Reference Trace. PAPI? We
> need to
> know the pc of the instructions which incur most of the cache miss. I am
> afraid of
> the profiling overhead.

I would suggest to try this library: http://perfmon2.sourceforge.net/
(paper: http://perfmon2.sourceforge.net/ols2006-perfmon2.pdf)
PAPI was a good library several years ago but today it's development is
frozen and AFAIK P4 support is incomplete.

Why do you worry about profiling overhead if you can counfigure PMU to store
in buffer only one event per million?

2.How to abstract the Data Reference Trace for JAVA? Egor had given some
> clue
> on it.

This is the key problem. I hope I will have an opinion on it when I read all
of the articles you mentioned.
The only idea I have now is that using an address of an object is not a
problem. Once GC moves objects we can ask it to patch our references too.

And now I want to know the schedule for this project.  I expect to join it.
> But I am not
> familiar with GC and interface for profile in drlvm, can you give me some
> clue or
> suggestion to learn them as soon as possible. Thanks.

I'm also not a guru in GC, but I think there are a lot of people here who
know the answers and can help in the area they are interested.
So the proposal is:
1) You make the schedule and lead this project. I can help you with dynamic
recompilation scenarios implementations and with JIT code generation and
optimizations.
2) Why do not to implement classic Chilimbi algorithm first and enhance it
with PMU data when it's tested?

-- 
Mikhail Fursov

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

En, I list some paper I have read.

1.Efficient Representations and Abstractions for Quantifying and Exploiting
Data Reference
   Locality.
2.Bursty Tracing: A Framework for Low-Overhead Temporal Profiling
3.Dynamic Hot Data Stream Prefetching for General-Purpose Programs
4.Profile-guided Proactive Garbage Collection for Locality Optimization

list 1-4 provide the framework to trace, and the algorithm to process the
trace.

If you want to learn more about SEQUITUR algorithm, you can read
5.Whole Program Paths

All these paper can be found by using Goole. From these paper, you can
find more
that you need to read.

And the problems I faced now are:
1. Which library to choose to profile the Data Reference Trace. PAPI? We
need to
know the pc of the instructions which incur most of the cache miss. I am
afraid of
the profiling overhead.

2.How to abstract the Data Reference Trace for JAVA? Egor had given some
clue
on it.

And now I want to know the schedule for this project.  I expect to join it.
But I am not
familiar with GC and interface for profile in drlvm, can you give me some
clue or
suggestion to learn them as soon as possible. Thanks.

-- 
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by Mikhail Fursov <mi...@gmail.com>.

Qiong,
I think I need to read the whole bunch of articles of Chilimbi this
weekend before
to proceed with implementation.
What is the list of articles you can propose to read to anyone who want to
join this project?

On 8/25/06, zouqiong <us...@gmail.com> wrote:
>
> 2006/8/24, zouqiong <us...@gmail.com>:
> >
> > Hi, Mikhail,
> >
> > 1. As for the list 2, Chilimbi acutally use GC moving objects to improve
> > the cache
> > localty [Profile-guided Proactive Garbage Collection for Locality
> > Optimization], but without
> > his algorithm. I will read the paper again. His algorithm mainly focus
> on
> > applications in C
> > or C++. Maybe we can make use of GC to imporve the effect of the
> > algorithm.
> >
> > I made a mistake yesterday, Chilimbi actually investigate his algorithm
> behavior when GC move
> objects in memory. The paper "Profile-guided Proactive Garbage Collection
> for Locality Optimization"
> did so.
> --
> Best Regards,
> Qiong,Zou
>
>


-- 
Mikhail Fursov

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

2006/8/24, zouqiong <us...@gmail.com>:
>
> Hi, Mikhail,
>
> 1. As for the list 2, Chilimbi acutally use GC moving objects to improve
> the cache
> localty [Profile-guided Proactive Garbage Collection for Locality
> Optimization], but without
> his algorithm. I will read the paper again. His algorithm mainly focus on
> applications in C
> or C++. Maybe we can make use of GC to imporve the effect of the
> algorithm.
>
> I made a mistake yesterday, Chilimbi actually investigate his algorithm
behavior when GC move
objects in memory. The paper "Profile-guided Proactive Garbage Collection
for Locality Optimization"
did so.
-- 
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

Hi, Mikhail,

1. As for the list 2, Chilimbi acutally use GC moving objects to improve the
cache
localty [Profile-guided Proactive Garbage Collection for Locality
Optimization], but without
his algorithm. I will read the paper again. His algorithm mainly focus on
applications in C
or C++. Maybe we can make use of GC to imporve the effect of the algorithm.

2.
I read the Chilimbi`s paper again, and figure out a framework. And I will
discuss about
it here.
Firstly, I want to follow Chilimbi`s Bursty Tracing Framework [Bursty
Tracing: A Framework
for Low-Overhead Temporal Profiling ]. It has two versions of the same code,
one for
check, and another for instrument. And in the instrumented code, it will
record all the
memory access patterns, named Data reference trace. What I want to do is to
modify
his instrumented code, once nCheck equals 0, we will active the Performance
Counter
to profile the cache miss rate and cache miss address of the instrumented
code until
the nInstr equals 0. We set up a threshold, and the instructions with miss
rate larger than
it are delinquent load or delinquent store. As we only active the
Performance Counter in
a short period, it won`t bring much overhead.

Secondly, we abstract the delinquent instructions, output WPS, and then Use
the SEQUITUR
algorithm to process them. The comlexity of the algorithm is O(EL). In
Chilimbi`s paper [
Efficient Representations and Abstractions for Quantifying and Exploiting
Data Reference
Locality.], he reduces the trace by processing the WPS twice. And I think we
can follow
his way.

The mainly difference from Chilimbi`s algorithm is listed as above. There
must be some
deficiency. We can discuss about it.


--------------
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by Mikhail Fursov <mi...@gmail.com>.

Hi Qiong,
It's good that we've agreed on hardware issues, now we can discuss the
method by itself in details.
I read the Chilimbi's paper and think it can really work. But there are some
issues in article that are unclear to me and I want to discuss here.

1) Chilimbi's algorithm does not use  hardware counters at all. You said
that you want to "use the hardware performance counter
to get the data reference trace". I agree that PMU counters could help to
optimize the algorithm (e.g. we can do trace profiling only for hot methods
with high cache-miss rate) but I do not understand how do you want to
collect a trace with PMU counters? The usage of PMU counters in programs I
know is time or event based sampling. Sampling means that you get only 1
event of thousands: how to get the trace in this case? But anyway PMU could
be useful to avoid profiling in a methods with low cache-miss rate.

2) Chilimbi does not investigate his algorithm behaviour when GC moves
objects in memory. But once profiling is active the whole time of the
program execution I think that trace cache will be tuned very fast to the
new object locations.

3) I do not understand how Chilimbi's algorithm deals memory access in loops
(example is an iteration of LinkedList). SEQUITUR examples are very smart
when you have an alphabet with a limited number of letters. When you access
the 10000 or more memory addresses from inside of a loop how the trace will
look like?

-- 
Mikhail Fursov

Re: performance of drlvm

Posted by zouqiong <us...@gmail.com>.

Hi, Mikhail,
    I am happy to receive your letter.
    Nowadays I looked though many performance monitor tools, such as
oprofile, perfctr, papi, and at last i want to use perfctr or papi. Papi is
more user-friendly than perfctr, and have been used by many tools.
    This is what i thought about the limitations that you listed:

1. En, I have considered about this problem. We have to install a kernel
patch first. Luckily, both perfctr and papi provide kernel-patches.

2. I have no  problem  to  work on a P4, and the only machine i have is a P4
because i am a student, so I
have no money to buy other kind of machine.

3. Unluckily, papi or perfctr now only works on linux platform. But it`s ok
for me. Although working on all platform
is important, it can be thought about later. Working on one platform is the
first thing, and we should guarantee
its efficiency, control its overhead....

The limitation is ok for me. I think a framework that works efficiently is
important for me. And we can use  better programming skills for future
porting to other platforms.....


-- 
Best Regards,
Qiong,Zou

Re: performance of drlvm

Posted by Mikhail Fursov <mi...@gmail.com>.

Hi zouqiong,

I looked through Chilimbi article and I think it contains some good points
on tracing of memory access. I going to read it more carefully this weekend
and will ready to discuss their algorithm in details next week.

Now I propose to set the goals we want to achieve more precisely.
Once you decide that PMU profiling must be used we have face with the
following problems:

1)  Neither Linux nor Windows systems do now allow to developer to access
PMU without kernel patch or driver-installation.
So even if our implementation is successful and included into harmony
distribution it won't be enabled by default.

2) It's hard to maintain all of the CPU families. E.g. I think that we
should select P4 and newer architectures. I mean the implementation will be
easier if we have PEBS support in CPU.

3) I do not  know any open-source library that supports both Linux and
Windows and P4 PMU events. I'll try to find it out but I'm afraid we have to
limit our implementation only for Linux today.

If these limitation are OK we can create a separate thread with subject
prefix [drlvm][dpgo] to discuss our plans and algorithms in details.
If limitations are too much and you expect to write an optimization that
will work on any platform, may be instrumentation based value profiling
might be a right choice.

?

-- 
Mikhail Fursov