You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@harmony.apache.org by sh...@computer.org on 2005/05/23 01:48:54 UTC

Re: Other interesting papers and research

From: Steve Blackburn <St...@anu.edu.au>

> > acoliver@apache.org wrote:
> >
> >> The approach of using C Compiler generated code rather than writing a
> >> full compiler appeals to me:
> >> http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf
> >>
> >> I am curious on how well the approach performs compared to existing 
> >> JITs.

> They automatically build themselves
> simple JIT backends (by extracting fragments produced by the ahead of
> time compiler).  This sounds like a great way to achieve portability
> while achiving better performance than a conventional interpreter.

I guess it's a bit better or just comparable with a good interpreter.

In 1998, I have written such a JIT compiler concatinate code fragments
generated by GCC for each JVM instruction. Unfortunately, the JIT was
slightly slower than an interpreter in Sun's Classic VM. The
interpreter was written in x86 assembly language and implements
dynamic stack caching with 2 registers and 3 states. It performs much
better than the previous interpreter written in C.

Then I rewrote the JIT.

I am not very sure which is better for us, having a portable and so-so
baseline compiler or a good interpreter which is possibly less
portable than the compiler. There will be a trade off between memory
consumption, portability and so on.

  Kazuyuki Shudo	shudo@computer.org	http://www.shudo.net/

Re: Other interesting papers and research

Posted by Davanum Srinivas <da...@gmail.com>.

Found a paper from David too...
http://www.research.ibm.com/people/d/dgrove/papers/cgo05.html

On 6/6/05, shudo@computer.org <sh...@computer.org> wrote:
> Hi Rob,
> 
> > From: Robert Lougher <ro...@gmail.com>
> > Date: Mon, 6 Jun 2005 14:58:45 +0100
> 
> > > > One thing to
> > > > note is that a threaded interpreter would see something like a 2-4x
> > > > expansion over "normal" bytecodes when it converts from bytecodes to its
> > > > internal form (arrays of function pointers).
> > >
> > > Direct threading interpreters like JDK's one work on plain Java
> > > bytecode and they do not need to expand normal bytecode instructions.
> > > Such expansion may have been required if Java bytecode is not linear
> > > and rather a tree or other complicated form.
> >
> > According to my understanding, an indirect threaded interpreter uses
> > the original bytecode stream.  It's indirect because the handler
> > address must be looked up via the bytecode.
> 
> Ah, thanks for the indication.
> My wording 'direct threading' was not correct.
> 
> Threading (interpreting) techniques I referred as implemented in JDKs
> should be called 'token threading', neither direct nor indirect threading
> because they work directly on bytecode instructions withought any expansion.
> Note that the interpreter provides NEXT routines to for all native
> code fragments corresponding to VM instructions.
> For JVM, this wording like something threading is not very informative
> because direct interpretation of portable bytecode is naturally
> 'token threading'.
> 
> Dave's last posting was based on direct threading technique and his saying
> was correct about direct threading but my posting was incorrect in advance.
> 
>   Kazuyuki Shudo        shudo@computer.org      http://www.shudo.net/
> 


-- 
Davanum Srinivas - http://webservices.apache.org/~dims/

Re: Other interesting papers and research

Posted by sh...@computer.org.

Hi Rob,

> From: Robert Lougher <ro...@gmail.com>
> Date: Mon, 6 Jun 2005 14:58:45 +0100

> > > One thing to
> > > note is that a threaded interpreter would see something like a 2-4x
> > > expansion over "normal" bytecodes when it converts from bytecodes to its
> > > internal form (arrays of function pointers).
> >
> > Direct threading interpreters like JDK's one work on plain Java
> > bytecode and they do not need to expand normal bytecode instructions.
> > Such expansion may have been required if Java bytecode is not linear
> > and rather a tree or other complicated form.
>
> According to my understanding, an indirect threaded interpreter uses
> the original bytecode stream.  It's indirect because the handler
> address must be looked up via the bytecode.

Ah, thanks for the indication.
My wording 'direct threading' was not correct.

Threading (interpreting) techniques I referred as implemented in JDKs
should be called 'token threading', neither direct nor indirect threading
because they work directly on bytecode instructions withought any expansion.
Note that the interpreter provides NEXT routines to for all native
code fragments corresponding to VM instructions.
For JVM, this wording like something threading is not very informative
because direct interpretation of portable bytecode is naturally
'token threading'.

Dave's last posting was based on direct threading technique and his saying
was correct about direct threading but my posting was incorrect in advance.

  Kazuyuki Shudo	shudo@computer.org	http://www.shudo.net/

Re: Other interesting papers and research

Posted by Robert Lougher <ro...@gmail.com>.

Hi,

On 6/6/05, shudo@computer.org <sh...@computer.org> wrote:
> Hi Dave,
> 
> > From: David P Grove <gr...@us.ibm.com>
> >
> > One thing to
> > note is that a threaded interpreter would see something like a 2-4x
> > expansion over "normal" bytecodes when it converts from bytecodes to its
> > internal form (arrays of function pointers).
> 
> Direct threading interpreters like JDK's one work on plain Java
> bytecode and they do not need to expand normal bytecode instructions.
> Such expansion may have been required if Java bytecode is not linear
> and rather a tree or other complicated form.
> 

According to my understanding, an indirect threaded interpreter uses
the original bytecode stream.  It's indirect because the handler
address must be looked up via the bytecode.  A direct threaded
interpreter removes this step by placing the actual handler addresses
in the rewritten instruction stream.

For what it's worth, JamVM supports both direct and indirect
threading, with static and dynamic stack-caching respectively.  I seem
to recall an average code size increase of ~4x for JamVM's internal
instruction format but I'll need to recheck my figures to be sure. 
Note, this is assuming a 32-bit architecture.  Handler addresses will
be double on a 64-bit machine, and the code increase over bytecodes
therefore larger.

Rob.

Re: Other interesting papers and research

Posted by sh...@computer.org.

Hi Dave,

> From: David P Grove <gr...@us.ibm.com>

> shudo@computer.org wrote on 06/05/2005 10:48:29 PM:
>
> > - The machine code concatinating technique consumes much memory.
> >   In my experience, generated machine code is about 10 times larger
> >   than the original instructions in Java bytecode.
> >
> > In the paper, the authors have not mentioned memory consumption of the
> > technique.  We cannot guess how much it is precisely, but it is
> > possible to be a big drawback.  Yes, we can say the same for the
> > approach taking a baseline compiler instead of an interpreter (like
> > Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
> > is very interesting.
>
> It's platform dependent of course, but on IA32 isn't too horrible.  For
> example, running SPECjvm98 we see a 6.23x expansion from the Jikes RVM
> baseline compiler machine code bytes over bytecode bytes.

Thanks for giving us such an useful number.
It looks reasonable.

> One thing to
> note is that a threaded interpreter would see something like a 2-4x
> expansion over "normal" bytecodes when it converts from bytecodes to its
> internal form (arrays of function pointers).

Direct threading interpreters like JDK's one work on plain Java
bytecode and they do not need to expand normal bytecode instructions.
Such expansion may have been required if Java bytecode is not linear
and rather a tree or other complicated form.

Then,

> So, a 6x expansion is
> probably only roughly 2x worse than some interpreted systems would
> actually see in practice.

We have to just say the baseline compiler of Jikes RVM generates 6x
larger native code than the original bytecode instructions.


For Java-written JVM, it seems to be natural to have a baseline
compiler instead of an interpreter.

It looks complicated to have an interpreter for a Java-written JVM. We
hope that the architecture of a JVM (e.g. interpreter or baseline
compiler) is independent of the language for implementing a certain
part of JVM. But there seems to be an implication between them.
Any comment?

  Kazuyuki Shudo	shudo@computer.org	http://www.shudo.net/

Re: Other interesting papers and research

Posted by David P Grove <gr...@us.ibm.com>.

shudo@computer.org wrote on 06/05/2005 10:48:29 PM:


> - The machine code concatinating technique consumes much memory.
>   In my experience, generated machine code is about 10 times larger
>   than the original instructions in Java bytecode.
> 
> In the paper, the authors have not mentioned memory consumption of the
> technique.  We cannot guess how much it is precisely, but it is
> possible to be a big drawback.  Yes, we can say the same for the
> approach taking a baseline compiler instead of an interpreter (like
> Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
> is very interesting.

It's platform dependent of course, but on IA32 isn't too horrible.  For 
example, running SPECjvm98 we see a 6.23x expansion from the Jikes RVM 
baseline compiler machine code bytes over bytecode bytes.  One thing to 
note is that a threaded interpreter would see something like a 2-4x 
expansion over "normal" bytecodes when it converts from bytecodes to its 
internal form (arrays of function pointers). So, a 6x expansion is 
probably only roughly 2x worse than some interpreted systems would 
actually see in practice. You can get this data out of Jikes RVM for your 
platform/program with -X:vm:measureCompilation=true.

--dave

Re: Other interesting papers and research

Posted by sh...@computer.org.

Hi Steve and all,

| The approach of using C Compiler generated code rather than writing a
| full compiler appeals to me:
| http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

> From: Steve Blackburn <St...@anu.edu.au>
> Date: Tue, 24 May 2005 21:08:05 +1000

> >>They automatically build themselves
> >>simple JIT backends (by extracting fragments produced by the ahead of
> >>time compiler).  This sounds like a great way to achieve portability
> >>while achiving better performance than a conventional interpreter.

> >I guess it's a bit better or just comparable with a good interpreter.
>
> They say it is a lot better: "speedups of up to 1.87 over the fastest
> previous interpreter based technique, and performance comparable to
> simple native code compilers.

Their technique may be reasonable in cases,
but it is better to be aware of:

- On an Athlon processor, speedup (gforth-native over gforth-super) is
  less than those on PowerPC.  It is between 1.06 and 1.49.

- The machine code concatinating technique consumes much memory.
  In my experience, generated machine code is about 10 times larger
  than the original instructions in Java bytecode.

In the paper, the authors have not mentioned memory consumption of the
technique.  We cannot guess how much it is precisely, but it is
possible to be a big drawback.  Yes, we can say the same for the
approach taking a baseline compiler instead of an interpreter (like
Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
is very interesting.

And the next one does not reduce the value of the technique, but
better to know:

- The compared interpreter (gforth-super) could be improved
  by other techniques including stack caching.
  This means that their machine code concatinating technique may
  benefit from those remaining techniques.


> >In 1998, I have written such a JIT compiler concatinate code fragments
> >generated by GCC for each JVM instruction.
>
> Very interesting!
>
> >Unfortunately, the JIT was
> >slightly slower than an interpreter in Sun's Classic VM. The
> >interpreter was written in x86 assembly language and implements
> >dynamic stack caching with 2 registers and 3 states. It performs much
> >better than the previous interpreter written in C.
> >
> >Then I rewrote the JIT.

In reality, my old JIT compiler in C cannot be strictly compared to
the JDK interpreter in assembly language.

I try clarifying the situation.

                               inst. execution            stack handling
(1)Ertl&Gregg's gforth-super   direct threading      <2>  TOS in a register <2>
(2)Ertl&Gregg's gforth-native  concat'd machine code <1>  TOS in a register <2>
(3)JDK interpreter in C        switch threading      <3>  memory            <3>
(4)JDK interpreter in asm      direct threading      <2>  stack caching     <1>
(5)Shudo's first JIT           concat'd machine code <1>  memory            <3>
(6)Shudo's lightweight JIT     concat'd macihne code <1>  stack caching     <1>

Gforth-super (1) is the Ertl&Gregg's fastest interpreter compared to
the machine code concatinating technique (2).
(3) is the only interpreter which an ancient JDK 1.0.2 had.
(4) is an interpreter of JDK 1.1 written in assembly language.
(5) is my first JIT compiler which concatinates machine code
for each VM instruction.
(6) is shuJIT, the JIT compiler re-written after (5).

<1>-<3> mean the order of expected performance.
Machine code concatination proposed in the Ertl&Gregg's paper is
marked with <1> because it is the best in those in this table
for VM instruction execution.
(4) and (6) uses 2 machine registers for stack caching and
Ertl&Gregg's implementations uses one register
to cache the top of stack (TOS).

Ertl&Gregg's paper says that gforth-native (2) provided speedups up to
1.87 over gforth-super (1).

I wrote in a previous posting that (5) was slower than (4), but (5)
was inferior to (4) in stack handling. It is possible to be the reason
of the slowness and then I cannot say that the machine code
concatinating technique is not necessarily faster than a good
interpreter like (4).


The followings are other facts I experienced:

- Shudo's lightweight JIT (6) generates about 10 times larger native code
  compared with the original Java bytecode instructions.
  This is possible to be a drawback of an approach taking a baseline
  compiler instead of an interpreter.

  Does the Ertl&Gregg's machine concatinating technique suffer it?
  How about a baseline compiler of Jikes RVM?

- An interpreter of JDK 1.1 written in assembly (4) executed
  a numerical benchmark, Linpack about twice as fast as
  the previous interpreter in C (3).

  A lesson here is that an interpreter is often regarded as just slow
  but a faster one and a slower one are very different in performance.

- Register utilization is still important even for an interpreter.
  Interpreters (1) and (4), and lightweight JIT compilers (2) and (6)
  utilizes one or two machine registers to cache values around TOS.
  It is also important to map VM registers (e.g. PC) onto machine registers.

  This means a good interpreter cannot be implemented in a completely
  portable way.

  Note that I do not know how Ertl&Gregg's implementations (1) and (2)
  use a machine register to keep TOS.  I guess they have a little
  assembly code.


  Kazuyuki Shudo	shudo@computer.org	http://www.shudo.net/

Re: Other interesting papers and research

Posted by Steve Blackburn <St...@anu.edu.au>.

shudo@computer.org wrote:

>>They automatically build themselves
>>simple JIT backends (by extracting fragments produced by the ahead of
>>time compiler).  This sounds like a great way to achieve portability
>>while achiving better performance than a conventional interpreter.
>>    
>>
>
>I guess it's a bit better or just comparable with a good interpreter.
>  
>
They say it is a lot better: "speedups of up to 1.87 over the fastest 
previous interpreter based technique, and performance comparable to 
simple native code compilers. The effort required for retargeting our 
implementation from the 386 to the PPC architecture was less than a 
person day."

>In 1998, I have written such a JIT compiler concatinate code fragments
>generated by GCC for each JVM instruction.
>
Very interesting!

>Unfortunately, the JIT was
>slightly slower than an interpreter in Sun's Classic VM. The
>interpreter was written in x86 assembly language and implements
>dynamic stack caching with 2 registers and 3 states. It performs much
>better than the previous interpreter written in C.
>
>Then I rewrote the JIT.
>  
>
It would be interesting to hear your perspective on Ertl & Gregg's 
approach.  Did they do something you had not done?  Do they have any 
particular insight?  You are in an excellent position to make a critical 
assessment of their work.

>I am not very sure which is better for us, having a portable and so-so
>baseline compiler or a good interpreter which is possibly less
>portable than the compiler. There will be a trade off between memory
>consumption, portability and so on.
>  
>
Ideally we will have both as components and the  capacity to choose 
either or both depending on the build we are targetting.

Cheers,

--Steve