You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@groovy.apache.org by MG <mg...@arscreat.com> on 2022/05/06 23:55:19 UTC

Groovy 3 -> 4 Performance Degradation Sample

Hi Groovy devs,

I have finally managed to extract a sample from our Groovy framework 
that only uses our most basic modules and still exhibits the 2-3x 
performance degradation between (non-indy) Groovy 3 and Groovy 4 I 
already described in multiple posts a while ago.
The code was built using the latest Groovy releases (3.0.10 / 4.0.1), 
you can find the 2 JARs & a test script at:

https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample

To check the performance:

 1. Open a GroovyConsole window for Groovy 3.0.10
 2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
 3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy

Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.

On 3 different PCs I consistently get results as shown below (each timed 
test step executes the same code 100x, each time creating the same 
simple SQL GString using our framework):

*Groovy 3.0.10*
0) dt = 7.99 ms
1) dt = 2.01 ms
2) dt = 1.53 ms
3) dt = 1.65 ms
4) dt = 1.36 ms

*Groovy 4.0.1*
0) dt = 16.51 ms
1) dt = 7.14 ms
2) dt = 5.83 ms
3) dt = 6.6 ms
4) dt = 6.24 ms

Throwing away the first loop, which includes framework setup time, 
Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my 
notebook the factor is closer to 2).
Execution times generally decrease when starting the script multiple 
times, but the best dt I have observed on the PC the above measurements 
were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any 
case this is irrelevant for practical applications, since here a short, 
identical code segment will not be executed n x 500-times in a loop).

The performance degradation exhibited here is consistent with the 
performance of a) executing our test suite and b) startup & refresh 
cycle measurements I did in a web application based on our framwork 
code, as described before.

I did not get anywhere trying to use VisualVM trying to pinpoint where 
exactly performance was lost, apart from what I described in a previous 
post.

Groovy 3 exhibits the same performance degradation when used in indy 
mode, so it seems logical to assume that this mechanism is somehow to blame.
Based on this observation I tried to refactor parts of our framework 
which create a lot of new objects / do a lot of method calls, but even 
short-circuiting these did not lead to any relevant change in performance.

So this is hoping this finally supplies the information that leads to a 
solution to this problem by someone that has knowledge about the inner 
workings of Groovy & the JVM invokedynamic mechanism, and un-bars us 
from ever switching to Groovy 4  G-)

Cheers,
mg

Re: Groovy 3 -> 4 Performance Degradation - Clojure/Jython/JRuby/... ?

Posted by MG <mg...@arscreat.com>.

Another question would be, whether anyone here knows whether other 
dynamic JVM languages such as Clojure/Jython/JRuby/... are
a) using Indy calls, and
b) are or have been facing similar performance problems... ?

On 09/05/2022 19:53, Jochen Theodorou wrote:
> On 09.05.22 17:41, MG wrote:
>> Hi Jochen,
>>
>> since I am not feeling well right now, just some quick answers to some
>> of your questions:
>>
>>  1. I am not at liberty to release the source code from our project, not
>>     even partially, so these JARs are currently all I can provide. I
>>     felt when I reported this problem initially, there were doubts that
>>     it even exists, so after I could not pinpoint the problem on my
>>     side, I invested the time to extract these project parts, clean them
>>     from any historical references to other modules etc and supply them
>>     in compiled form.
>
> I think I can test if it is really my assumption or not. I made myself a
> little program, that will create a class with many methods and one more
> method, that is calling all of them, one-by-one. The methods are empty
> and nothing takes parameters. I then put this in the compiler and
> measure times for creating the class and running the methods.
>
> For 1000 Methods I get on my machine these results for Groovy-3.0.9
> non-Indy:
> 1 Class: 36ms
> 10 Classes: 114ms
> 100 Classes: 985ms
> 1000 Classes: 8227ms
>
> the last case means, that 1+ million methods are executed inside of the
> test class
>
> Then I repeat the test with Groovy-4.0.2 Indy:
> 1  Class: 137ms        (* 3,80)
> 10 Classes: 670ms      (* 5,88)
> 100 Classes: 3711ms    (* 3,77)
> 1000 Classes: 108235ms (* 13,16)
>
> I would not classify these numbers as a proper test, but it does hint at
> what I had assumed... creating the initial callsites in indy is too
> expensive
>
> the script I used:
>
>> def CLASSES = args[0] as Integer
>> def METHODS = args[1] as Integer
>>
>>
>> def methods = ""
>> def calls   = ""
>> METHODS.times { i ->
>>   methods += "\ndef foo$i(){}"
>>   calls   += "\nfoo$i()"
>> }
>>
>> def gcl = new GroovyClassLoader();
>> def testClasses = (1..CLASSES).collect{
>>     def testClass = """
>>     class MyPerfTest$it{
>>       $methods
>>       def run() {
>>         $calls
>>       }
>>     }
>>     """
>>     gcl.parseClass(testClass)
>> }
>>
>>
>> def time1 = System.nanoTime()
>> for (clazz :testClasses) {
>>   clazz.newInstance().run()
>> }
>> def time2 = System.nanoTime()
>> def timeDiff = (time2-time1)/1000000
>> println "Time = $timeDiff ms"
>
> While this is surely not "real world" my goal was testing the callsite
> creation, and I think the test does this well enough. Why exactly this
> scenario? Because I know indy can perform well in micro-benchmarks,
> where you crunch a number in tight loops. and if there is no problem
> there, then it must be with the many callsites you tend to visit only
> once - like here.
>
>>  2. As I have said in previous posts, the performance degradation occurs
>>     with non-Indy vs Indy. In fact this is how I could pinpoint that
>>     Indy was to blame: After realizing that our test suite ran for 3h
>>     under Groovy 4, instead of 1h under Groovy 3, and that this was not
>>     restricted to a few tests becoming really slow, but was a more or
>>     less general phenomenon (which was puzzling), I ran the test suite
>>     under Groovy 3 Indy (for the first time), and got basically the same
>>     reduced performance.
>
> I had a suspicion for a long time already actually. Bad since I sadly
> have barely any time for Groovy these days... and checking this can take
> very very long.
>
> [...]
>>  4. Throwing away the first test loop: It is a bit of an acdemic
>>     discusssion, since Groovy 4 performance is always 2 to 3 times
>>     slower in any case. You can execute the test script interleaved
>>     between Grooyv 3 / 4 multiple times to see how measurements develop,
>>     but I would say it is clear that Groovy 4 is always way slower than
>>     Groovy 3 for this test, and that is all that matters :-)
>
> in my little test factor 2-3 times is for me factor 3-4 and that was my
> best case.
>
> bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by MG <mg...@arscreat.com>.

Hi Jochen,

just a quick feedback: We use distinct classes derived from a Table base 
class to model concrete database tables (e.g. PERSON_Table, 
DEPARTMENT_Table, ADDRESS_Table, etc), but the different columns in 
those tables are not modelled as separate classes, but are just 
instances of the same Column class (or one of a few child classes that 
represent diffferent types of reference columns), e.g. it looks 
something like (actual code uses static imports / helper methods to be 
more compact / readable):

class PERSON_Table extends Table {
     @Lazy static final PERSON = new PERSON_Table('PERSON','pe')
     final ID = new Column(this, 'ID', SqlTypes.ID_DEFAULT, true)
     final DEPARTMENT_ID = new TableRefColumn(this, 
DEPARTMENT_Table.DEPARTMENT)
     final FIRST_NAME = new Column(this, 'FIRST_NAME ', 
SqlTypes.VARCHAR(64))
     final LAST_NAME = new Column(this, 'LAST_NAME ', SqlTypes.VARCHAR(128))
     final DATE_OF_BIRTH = new Column(this, 'DATE_OF_BIRTH ', SqlTypes.DATE)
     // ...
}

Cheers,
mg


On 10/05/2022 20:09, Jochen Theodorou wrote:
> On 09.05.22 22:13, MG wrote:
>> Hi Jochen,
>>
>> our code certainly creates a lot of (short lived) class instances*, but
>> we do not dynamically create new classes, so the overhead of creating
>> the initial Indy callsites should only be incurred once at the
>> beginning, and then should no longer play a role. So if the same SQL
>> creation code is executed in a loop as in the sample script I provided,
>> each loop after the first should be as fast as non-Indy Groovy, no ?
>
> You pay the cost every time the callsite gets invalidated. If you do
>
> def getMeta(table, column) {
>   table.getMetaData(column)
> }
>
> getMeta(PERSON, PERSON_NAME)
> getMeta(PERSON, PERSON_STREET)
>
> class PERSON extends Table {
> ...
>   class PERSON_NAME extends Column {...}
>   class PERSON_STREET extends Column {...}
> ...
> }
>
> then you have 3 callsites, but 4 inits, since the callsite in getMeta
> will be invalidated for the change from PERSON_NAME to PERSON_STREET.
> And if you do
>
> 10.times {
>   getMeta(PERSON, PERSON_NAME)
>   getMeta(PERSON, PERSON_STREET)
> }
>
> you get 20 callsite inits plus 19 callsite invalidations in getMetaClass.
>
> Even with caching.. in the current implementation we check only on the
> receiver, but here an argument is changing.
>
> If I rerun my test with 1 classes a 1 method and let this run for 500k
> times then I get
>
> Groovy 4 indy: 68-73
> Groovy 3 non-indy: 90-101
>
> Of course the less iterations the worse for indy:
>
> Groovy 4
>
> 1k: 1215
> 10k: 231
> 100k: 105
> 1m: 70
>
> Groovy 3
> 1k: 780-1000
> 10k: 169
> 100k: 96
> 1m: 90
>
> each run involves running the code 10 times, then measuring 10 times to
> really really trying to exclude any initial costs. And from that
> perspective Groovy 4 indy is better.
>
>
>> Do you think that a Groovy feature such as e.g. traits could be
>> responsible for the constant need for callsite creation ? A less-used
>> Groovy feature such as traits, that we use only in certain, more complex
>> scenarios, being responsible might explain why the performance problems
>> a) occur only in more complex test cases, and b) up to date no other
>> project seems to suffer the same problem as we do...
> [...]
>
> I do not remember anymore how they are compiled - but from what I
> remember I think there is a level of indirection involved, just don't
> remember if that call is in dynamic or static. If the first, then it
> might be
>
>
> bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by Jochen Theodorou <bl...@gmx.org>.

On 09.05.22 22:13, MG wrote:
> Hi Jochen,
>
> our code certainly creates a lot of (short lived) class instances*, but
> we do not dynamically create new classes, so the overhead of creating
> the initial Indy callsites should only be incurred once at the
> beginning, and then should no longer play a role. So if the same SQL
> creation code is executed in a loop as in the sample script I provided,
> each loop after the first should be as fast as non-Indy Groovy, no ?

You pay the cost every time the callsite gets invalidated. If you do

def getMeta(table, column) {
   table.getMetaData(column)
}

getMeta(PERSON, PERSON_NAME)
getMeta(PERSON, PERSON_STREET)

class PERSON extends Table {
...
   class PERSON_NAME extends Column {...}
   class PERSON_STREET extends Column {...}
...
}

then you have 3 callsites, but 4 inits, since the callsite in getMeta
will be invalidated for the change from PERSON_NAME to PERSON_STREET.
And if you do

10.times {
   getMeta(PERSON, PERSON_NAME)
   getMeta(PERSON, PERSON_STREET)
}

you get 20 callsite inits plus 19 callsite invalidations in getMetaClass.

Even with caching.. in the current implementation we check only on the
receiver, but here an argument is changing.

If I rerun my test with 1 classes a 1 method and let this run for 500k
times then I get

Groovy 4 indy: 68-73
Groovy 3 non-indy: 90-101

Of course the less iterations the worse for indy:

Groovy 4

1k: 1215
10k: 231
100k: 105
1m: 70

Groovy 3
1k: 780-1000
10k: 169
100k: 96
1m: 90

each run involves running the code 10 times, then measuring 10 times to
really really trying to exclude any initial costs. And from that
perspective Groovy 4 indy is better.

> Do you think that a Groovy feature such as e.g. traits could be
> responsible for the constant need for callsite creation ? A less-used
> Groovy feature such as traits, that we use only in certain, more complex
> scenarios, being responsible might explain why the performance problems
> a) occur only in more complex test cases, and b) up to date no other
> project seems to suffer the same problem as we do...
[...]

I do not remember anymore how they are compiled - but from what I
remember I think there is a level of indirection involved, just don't
remember if that call is in dynamic or static. If the first, then it
might be

bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by MG <mg...@arscreat.com>.

Hi Jochen,

our code certainly creates a lot of (short lived) class instances*, but 
we do not dynamically create new classes, so the overhead of creating 
the initial Indy callsites should only be incurred once at the 
beginning, and then should no longer play a role. So if the same SQL 
creation code is executed in a loop as in the sample script I provided, 
each loop after the first should be as fast as non-Indy Groovy, no ?

Do you think that a Groovy feature such as e.g. traits could be 
responsible for the constant need for callsite creation ? A less-used 
Groovy feature such as traits, that we use only in certain, more complex 
scenarios, being responsible might explain why the performance problems 
a) occur only in more complex test cases, and b) up to date no other 
project seems to suffer the same problem as we do...

(I just checked the effect of removing the Table.reference(...) calls 
from the script code, since these calls use reflection to create the 
correct Table instance, to see if that has any influence,  but this 
leads to the same amount of speedup for Groovy 3 and 4, so the 
performance degradation between Grooyv 3 and 4 stays the same at 3x)

Cheers,
mg

*Every database table is a Groovy class, with each of the table's 
columns represented by a Column class instance; every table references 
in a SQL expressions is another instatiation of the table's class and 
especially its Column objects (Column objects know their table 
reference, so they cannot be shared with the table root instance), and a 
lot of SQL creation code instantiates multiple new table references.


On 09/05/2022 19:53, Jochen Theodorou wrote:
> On 09.05.22 17:41, MG wrote:
>> Hi Jochen,
>>
>> since I am not feeling well right now, just some quick answers to some
>> of your questions:
>>
>>  1. I am not at liberty to release the source code from our project, not
>>     even partially, so these JARs are currently all I can provide. I
>>     felt when I reported this problem initially, there were doubts that
>>     it even exists, so after I could not pinpoint the problem on my
>>     side, I invested the time to extract these project parts, clean them
>>     from any historical references to other modules etc and supply them
>>     in compiled form.
>
> I think I can test if it is really my assumption or not. I made myself a
> little program, that will create a class with many methods and one more
> method, that is calling all of them, one-by-one. The methods are empty
> and nothing takes parameters. I then put this in the compiler and
> measure times for creating the class and running the methods.
>
> For 1000 Methods I get on my machine these results for Groovy-3.0.9
> non-Indy:
> 1 Class: 36ms
> 10 Classes: 114ms
> 100 Classes: 985ms
> 1000 Classes: 8227ms
>
> the last case means, that 1+ million methods are executed inside of the
> test class
>
> Then I repeat the test with Groovy-4.0.2 Indy:
> 1  Class: 137ms        (* 3,80)
> 10 Classes: 670ms      (* 5,88)
> 100 Classes: 3711ms    (* 3,77)
> 1000 Classes: 108235ms (* 13,16)
>
> I would not classify these numbers as a proper test, but it does hint at
> what I had assumed... creating the initial callsites in indy is too
> expensive
>
> the script I used:
>
>> def CLASSES = args[0] as Integer
>> def METHODS = args[1] as Integer
>>
>>
>> def methods = ""
>> def calls   = ""
>> METHODS.times { i ->
>>   methods += "\ndef foo$i(){}"
>>   calls   += "\nfoo$i()"
>> }
>>
>> def gcl = new GroovyClassLoader();
>> def testClasses = (1..CLASSES).collect{
>>     def testClass = """
>>     class MyPerfTest$it{
>>       $methods
>>       def run() {
>>         $calls
>>       }
>>     }
>>     """
>>     gcl.parseClass(testClass)
>> }
>>
>>
>> def time1 = System.nanoTime()
>> for (clazz :testClasses) {
>>   clazz.newInstance().run()
>> }
>> def time2 = System.nanoTime()
>> def timeDiff = (time2-time1)/1000000
>> println "Time = $timeDiff ms"
>
> While this is surely not "real world" my goal was testing the callsite
> creation, and I think the test does this well enough. Why exactly this
> scenario? Because I know indy can perform well in micro-benchmarks,
> where you crunch a number in tight loops. and if there is no problem
> there, then it must be with the many callsites you tend to visit only
> once - like here.
>
>>  2. As I have said in previous posts, the performance degradation occurs
>>     with non-Indy vs Indy. In fact this is how I could pinpoint that
>>     Indy was to blame: After realizing that our test suite ran for 3h
>>     under Groovy 4, instead of 1h under Groovy 3, and that this was not
>>     restricted to a few tests becoming really slow, but was a more or
>>     less general phenomenon (which was puzzling), I ran the test suite
>>     under Groovy 3 Indy (for the first time), and got basically the same
>>     reduced performance.
>
> I had a suspicion for a long time already actually. Bad since I sadly
> have barely any time for Groovy these days... and checking this can take
> very very long.
>
> [...]
>>  4. Throwing away the first test loop: It is a bit of an acdemic
>>     discusssion, since Groovy 4 performance is always 2 to 3 times
>>     slower in any case. You can execute the test script interleaved
>>     between Grooyv 3 / 4 multiple times to see how measurements develop,
>>     but I would say it is clear that Groovy 4 is always way slower than
>>     Groovy 3 for this test, and that is all that matters :-)
>
> in my little test factor 2-3 times is for me factor 3-4 and that was my
> best case.
>
> bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by Jochen Theodorou <bl...@gmx.org>.

On 09.05.22 17:41, MG wrote:
> Hi Jochen,
>
> since I am not feeling well right now, just some quick answers to some
> of your questions:
>
>  1. I am not at liberty to release the source code from our project, not
>     even partially, so these JARs are currently all I can provide. I
>     felt when I reported this problem initially, there were doubts that
>     it even exists, so after I could not pinpoint the problem on my
>     side, I invested the time to extract these project parts, clean them
>     from any historical references to other modules etc and supply them
>     in compiled form.

I think I can test if it is really my assumption or not. I made myself a
little program, that will create a class with many methods and one more
method, that is calling all of them, one-by-one. The methods are empty
and nothing takes parameters. I then put this in the compiler and
measure times for creating the class and running the methods.

For 1000 Methods I get on my machine these results for Groovy-3.0.9
non-Indy:
1 Class: 36ms
10 Classes: 114ms
100 Classes: 985ms
1000 Classes: 8227ms

the last case means, that 1+ million methods are executed inside of the
test class

Then I repeat the test with Groovy-4.0.2 Indy:
1  Class: 137ms        (* 3,80)
10 Classes: 670ms      (* 5,88)
100 Classes: 3711ms    (* 3,77)
1000 Classes: 108235ms (* 13,16)

I would not classify these numbers as a proper test, but it does hint at
what I had assumed... creating the initial callsites in indy is too
expensive

the script I used:

> def CLASSES = args[0] as Integer
> def METHODS = args[1] as Integer
>
>
> def methods = ""
> def calls   = ""
> METHODS.times { i ->
>   methods += "\ndef foo$i(){}"
>   calls   += "\nfoo$i()"
> }
>
> def gcl = new GroovyClassLoader();
> def testClasses = (1..CLASSES).collect{
> 	def testClass = """
> 	class MyPerfTest$it{
> 	  $methods
> 	  def run() {
> 	    $calls
> 	  }
> 	}
> 	"""
> 	gcl.parseClass(testClass)
> }
>
>
> def time1 = System.nanoTime()
> for (clazz :testClasses) {
>   clazz.newInstance().run()
> }
> def time2 = System.nanoTime()
> def timeDiff = (time2-time1)/1000000
> println "Time = $timeDiff ms"

While this is surely not "real world" my goal was testing the callsite
creation, and I think the test does this well enough. Why exactly this
scenario? Because I know indy can perform well in micro-benchmarks,
where you crunch a number in tight loops. and if there is no problem
there, then it must be with the many callsites you tend to visit only
once - like here.

>  2. As I have said in previous posts, the performance degradation occurs
>     with non-Indy vs Indy. In fact this is how I could pinpoint that
>     Indy was to blame: After realizing that our test suite ran for 3h
>     under Groovy 4, instead of 1h under Groovy 3, and that this was not
>     restricted to a few tests becoming really slow, but was a more or
>     less general phenomenon (which was puzzling), I ran the test suite
>     under Groovy 3 Indy (for the first time), and got basically the same
>     reduced performance.

I had a suspicion for a long time already actually. Bad since I sadly
have barely any time for Groovy these days... and checking this can take
very very long.

[...]
>  4. Throwing away the first test loop: It is a bit of an acdemic
>     discusssion, since Groovy 4 performance is always 2 to 3 times
>     slower in any case. You can execute the test script interleaved
>     between Grooyv 3 / 4 multiple times to see how measurements develop,
>     but I would say it is clear that Groovy 4 is always way slower than
>     Groovy 3 for this test, and that is all that matters :-)

in my little test factor 2-3 times is for me factor 3-4 and that was my
best case.

bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by MG <mg...@arscreat.com>.

Hi Jochen,

since I am not feeling well right now, just some quick answers to some 
of your questions:

 1. I am not at liberty to release the source code from our project, not
    even partially, so these JARs are currently all I can provide. I
    felt when I reported this problem initially, there were doubts that
    it even exists, so after I could not pinpoint the problem on my
    side, I invested the time to extract these project parts, clean them
    from any historical references to other modules etc and supply them
    in compiled form.
 2. As I have said in previous posts, the performance degradation occurs
    with non-Indy vs Indy. In fact this is how I could pinpoint that
    Indy was to blame: After realizing that our test suite ran for 3h
    under Groovy 4, instead of 1h under Groovy 3, and that this was not
    restricted to a few tests becoming really slow, but was a more or
    less general phenomenon (which was puzzling), I ran the test suite
    under Groovy 3 Indy (for the first time), and got basically the same
    reduced performance.
 3. This is not a micro benchmark simply because I was unable to create
    a small concise example that reproduces the problem. On the
    contrary, it always seemed that a large part of our framework needed
    to be involved for the performance degradation to show itself. That
    is why it took so long to produce a working sample, and why the test
    script contains some classes/traits/interfaces at the start: I had
    to reproduce somce code from the rest of our framework in the
    script, to reproduce one of the simpler tests which exhibited the
    problem.
 4. Throwing away the first test loop: It is a bit of an acdemic
    discusssion, since Groovy 4 performance is always 2 to 3 times
    slower in any case. You can execute the test script interleaved
    between Grooyv 3 / 4 multiple times to see how measurements develop,
    but I would say it is clear that Groovy 4 is always way slower than
    Groovy 3 for this test, and that is all that matters :-)

Cheers,
mg


On 08/05/2022 19:53, Jochen Theodorou wrote:
> On 07.05.22 01:55, MG wrote:
>> Hi Groovy devs,
>>
>> I have finally managed to extract a sample from our Groovy framework
>> that only uses our most basic modules and still exhibits the 2-3x
>> performance degradation between (non-indy) Groovy 3 and Groovy 4 I
>> already described in multiple posts a while ago.
>> The code was built using the latest Groovy releases (3.0.10 / 4.0.1),
>> you can find the 2 JARs & a test script at:
>>
>> https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample 
>>
>
> The problem is that the jars contain compiled Groovy code only, which
> means we do not know the source for them... makes it a bit harder to
> find out what is actually going wrong by trying to reproduce it
>
>> To check the performance:
>>
>>  1. Open a GroovyConsole window for Groovy 3.0.10
>>  2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
>>  3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy
>>
>> Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.
>>
>> On 3 different PCs I consistently get results as shown below (each timed
>> test step executes the same code 100x, each time creating the same
>> simple SQL GString using our framework):
>>
>> *Groovy 3.0.10*
>> 0) dt = 7.99 ms
>> 1) dt = 2.01 ms
>> 2) dt = 1.53 ms
>> 3) dt = 1.65 ms
>> 4) dt = 1.36 ms
>>
>> *Groovy 4.0.1*
>> 0) dt = 16.51 ms
>> 1) dt = 7.14 ms
>> 2) dt = 5.83 ms
>> 3) dt = 6.6 ms
>> 4) dt = 6.24 ms
>
> just a note, since you mention this later: This is Groovy 3 non-Indy vs.
> Groovy 4 (always indy)
>
>> Throwing away the first loop, which includes framework setup time,
>> Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my
>> notebook the factor is closer to 2).
>
> I am not sure that throwing out the first loop is actually right to do.
> As always with benchmarks, the problem is partially figuring out what
> you are really measuring. My assumption is that the initial callsite
> creation is much slower in indy, than it was with non-indy. If you have
> a lot of 1-time only invocations, this would pile up quite a bit.
>
>> Execution times generally decrease when starting the script multiple
>> times, but the best dt I have observed on the PC the above measurements
>> were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any
>> case this is irrelevant for practical applications, since here a short,
>> identical code segment will not be executed n x 500-times in a loop).
> [...]
>
> You have to imagine it like this for non-indy...
> You do a method call, Groovy finds the method that is called and
> generates bytecode for a call to this method, guarded by some class
> (receiver and parameters) checks basically. Subsequent call are then
> done using this optimized path and all you pay compared to normal Java
> is the class checks, some special logic for category methods and like
> 2-3 method calls. All can be inlined pretty well. But won't with this
> low amount of calls. Still the path itself is pretty fast already on the
> second execution.
>
> The problem here is that the first execution is actually done from
> within the callsite code. This has to happen, as somebody has to do the
> invocation in the end. And even the nth invocation is done from Groovy
> created code. If you have to deal with the java module system, then this
> is a killer. But anyway. Let us note for the first time call performance:
>
> method selection + small (unverified) bytecode generation + actual
> invocation, with 2-3 method calls deepness and some type transformations
> before you get to the actual method
>
> With Indy you have the bootstrap method, which is called and here we do
> basically the method selection and produce method handles, method types,
> guards and type transformations. There is also some kind of cache....
>
> Now the question is, what is going wrong. The method selection is not
> the cost, since that is in both the same. Type transformations are done
> differently in both. In the callsite code we basically generate code for
> it, in indy we create a handle for it.
>
> Basically I see 2 basic possible troubles:
> 1) creating the initial selection is too expensive
> 2) the caching is not actually working
>
> The first one is one I suspect. In my opinion Indy is slower because of
> the type transformation code, because of how MethodTypes are produced
> for Indy, because an invocation of generated code is much faster than a
> method handle (which is kind of interpreted before Java creates bytecode
> for it). A bit slower would be no problem of course.
>
> I spend the whole Saturday looking at the code and trying to understand
> what is happening
>
> Well.. the structure seems to be that there is a size 4 LRU cache per
> callsite. But since the threshold is so high nothing is ever going to be
> cached. If I set the threshold to -1 (0) actually would not work, since
> there is a comparison with >, not >=. Every cache entry can be seen as a
> pair of something that will call the target and a fallback. The fallback
> seems to be to select the method again. Imho that should lead back to
> the cache, otherwise the cache makes no sense in a case of reselection
> of the cached method and basically degrades to a size 1 cache.
>
> But besides that... there is like a thousand callsites generated by your
> code - which does just mean, that this is no micro benchmark.
>
> Something does look strange though. For example the call to
> maxNameLength seems not to be properly cached. The cache stores as
> target selectMethod, every time, even if the threshold is set to 0. But
> then again it became 3:00 in the morning and I ran out of time
>
> As it should work imho:
>
> At first execution we build a handle, that can be called later on and
> store it in the cache. The fallback of this handle should be the cache.
> ASCII-Art for foo():
>
>>     handle
>>     /    \
>>   foo   cache
>
> The cache would need a key of the runtime type signature of the call.
> But since that could be expensive, the receiver alone would already help
> a little (that is what is done already). So if no old selection applies,
> a new method should be selected:
>
>>     handle
>>     /    \
>>   foo   cache
>>         /   \
>>       hit   miss
>>        |     |
>>       foo'  handle'
>
> or in some "simple" code:
>
>> const MH callCache = ...
>>
>> static initial run: install method selectMethodAndCreateCache(*) {
>>   MH target = selectMethod(*, fallback: callCache)
>>   CachedCallsite cs = new CachedCallSite()
>>   cs.install(target, receiverName)
>> }
>>
>> callCache(CachedCallsite cs, Object receiver, *) {
>>   target = cs.cache.findCachedMethod(receiver)
>>   if (target==null) {
>>     target=selectMethod(*)
>>     cs.install(target, receiverName(receiver))
>>   }
>>   target.invokeExact(*)
>> }
>
> and
>
>> findCachedMethod(receiver) {
>>   if (receiver of current target == receiver)
>>        disableCaching()
>>   return get(receiver)
>> }
>>
>> install(target, recceiverName) {
>>   super.setTarget(target)
>>   currentReceiverName = receiverName
>>   put(receiverName, target)
>> }
>
> The structure that my debugging suggests seems to be different from
> that. Furthermore...
>
> I tried a bunch of small improvements, that all did not really do a
> change, and would not be suitable for production (like disabling almost
> all guards), but it did not change the numbers significantly. I know
> this situation from back in 1.6 when I did the primitive optimizations.
> It was not until I could eliminate a call to synchronized to get better
> performance. In general concurrency-barriers are very bad for inlining.
> The cache code involves at least two of them, but not sure removing them
> would actually make things fast.
>
> The only thing I can imagine is building up the code slowly from "fast,
> but cannot much of anything" to "fully capable" and see at what point
> the performance breaks down. And there might be multiple such points.
>
> sadly I lack the time to do a project like that right now.
>
> bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by Jochen Theodorou <bl...@gmx.org>.

On 07.05.22 01:55, MG wrote:
> Hi Groovy devs,
>
> I have finally managed to extract a sample from our Groovy framework
> that only uses our most basic modules and still exhibits the 2-3x
> performance degradation between (non-indy) Groovy 3 and Groovy 4 I
> already described in multiple posts a while ago.
> The code was built using the latest Groovy releases (3.0.10 / 4.0.1),
> you can find the 2 JARs & a test script at:
>
> https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample

The problem is that the jars contain compiled Groovy code only, which
means we do not know the source for them... makes it a bit harder to
find out what is actually going wrong by trying to reproduce it

> To check the performance:
>
>  1. Open a GroovyConsole window for Groovy 3.0.10
>  2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
>  3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy
>
> Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.
>
> On 3 different PCs I consistently get results as shown below (each timed
> test step executes the same code 100x, each time creating the same
> simple SQL GString using our framework):
>
> *Groovy 3.0.10*
> 0) dt = 7.99 ms
> 1) dt = 2.01 ms
> 2) dt = 1.53 ms
> 3) dt = 1.65 ms
> 4) dt = 1.36 ms
>
> *Groovy 4.0.1*
> 0) dt = 16.51 ms
> 1) dt = 7.14 ms
> 2) dt = 5.83 ms
> 3) dt = 6.6 ms
> 4) dt = 6.24 ms

just a note, since you mention this later: This is Groovy 3 non-Indy vs.
Groovy 4 (always indy)

> Throwing away the first loop, which includes framework setup time,
> Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my
> notebook the factor is closer to 2).

I am not sure that throwing out the first loop is actually right to do.
As always with benchmarks, the problem is partially figuring out what
you are really measuring. My assumption is that the initial callsite
creation is much slower in indy, than it was with non-indy. If you have
a lot of 1-time only invocations, this would pile up quite a bit.

> Execution times generally decrease when starting the script multiple
> times, but the best dt I have observed on the PC the above measurements
> were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any
> case this is irrelevant for practical applications, since here a short,
> identical code segment will not be executed n x 500-times in a loop).
[...]

You have to imagine it like this for non-indy...
You do a method call, Groovy finds the method that is called and
generates bytecode for a call to this method, guarded by some class
(receiver and parameters) checks basically. Subsequent call are then
done using this optimized path and all you pay compared to normal Java
is the class checks, some special logic for category methods and like
2-3 method calls. All can be inlined pretty well. But won't with this
low amount of calls. Still the path itself is pretty fast already on the
second execution.

The problem here is that the first execution is actually done from
within the callsite code. This has to happen, as somebody has to do the
invocation in the end. And even the nth invocation is done from Groovy
created code. If you have to deal with the java module system, then this
is a killer. But anyway. Let us note for the first time call performance:

method selection + small (unverified) bytecode generation + actual
invocation, with 2-3 method calls deepness and some type transformations
before you get to the actual method

With Indy you have the bootstrap method, which is called and here we do
basically the method selection and produce method handles, method types,
guards and type transformations. There is also some kind of cache....

Now the question is, what is going wrong. The method selection is not
the cost, since that is in both the same. Type transformations are done
differently in both. In the callsite code we basically generate code for
it, in indy we create a handle for it.

Basically I see 2 basic possible troubles:
1) creating the initial selection is too expensive
2) the caching is not actually working

The first one is one I suspect. In my opinion Indy is slower because of
the type transformation code, because of how MethodTypes are produced
for Indy, because an invocation of generated code is much faster than a
method handle (which is kind of interpreted before Java creates bytecode
for it). A bit slower would be no problem of course.

I spend the whole Saturday looking at the code and trying to understand
what is happening

Well.. the structure seems to be that there is a size 4 LRU cache per
callsite. But since the threshold is so high nothing is ever going to be
cached. If I set the threshold to -1 (0) actually would not work, since
there is a comparison with >, not >=. Every cache entry can be seen as a
pair of something that will call the target and a fallback. The fallback
seems to be to select the method again. Imho that should lead back to
the cache, otherwise the cache makes no sense in a case of reselection
of the cached method and basically degrades to a size 1 cache.

But besides that... there is like a thousand callsites generated by your
code - which does just mean, that this is no micro benchmark.

Something does look strange though. For example the call to
maxNameLength seems not to be properly cached. The cache stores as
target selectMethod, every time, even if the threshold is set to 0. But
then again it became 3:00 in the morning and I ran out of time

As it should work imho:

At first execution we build a handle, that can be called later on and
store it in the cache. The fallback of this handle should be the cache.
ASCII-Art for foo():

>     handle
>     /    \
>   foo   cache

The cache would need a key of the runtime type signature of the call.
But since that could be expensive, the receiver alone would already help
a little (that is what is done already). So if no old selection applies,
a new method should be selected:

>     handle
>     /    \
>   foo   cache
>         /   \
>       hit   miss
>        |     |
>       foo'  handle'

or in some "simple" code:

> const MH callCache = ...
>
> static initial run: install method selectMethodAndCreateCache(*) {
>   MH target = selectMethod(*, fallback: callCache)
>   CachedCallsite cs = new CachedCallSite()
>   cs.install(target, receiverName)
> }
>
> callCache(CachedCallsite cs, Object receiver, *) {
>   target = cs.cache.findCachedMethod(receiver)
>   if (target==null) {
>     target=selectMethod(*)
>     cs.install(target, receiverName(receiver))
>   }
>   target.invokeExact(*)
> }

and

> findCachedMethod(receiver) {
>   if (receiver of current target == receiver)
>        disableCaching()
>   return get(receiver)
> }
>
> install(target, recceiverName) {
>   super.setTarget(target)
>   currentReceiverName = receiverName
>   put(receiverName, target)
> }

The structure that my debugging suggests seems to be different from
that. Furthermore...

I tried a bunch of small improvements, that all did not really do a
change, and would not be suitable for production (like disabling almost
all guards), but it did not change the numbers significantly. I know
this situation from back in 1.6 when I did the primitive optimizations.
It was not until I could eliminate a call to synchronized to get better
performance. In general concurrency-barriers are very bad for inlining.
The cache code involves at least two of them, but not sure removing them
would actually make things fast.

The only thing I can imagine is building up the code slowly from "fast,
but cannot much of anything" to "fully capable" and see at what point
the performance breaks down. And there might be multiple such points.

sadly I lack the time to do a project like that right now.

bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by MG <mg...@arscreat.com>.

Hi Daniel,

could you confirm that with my sample ? I have tried all the 
optimization settings when you proposed them before 
("-Dgroovy.indy.optimize.threshold=0" threw an NPE, but 
"-Dgroovy.indy.optimize.threshold=2" should lead to the same result as 
far as I understand), and that alas did not lead to the performance of 
Groovy 3... (see performance tables in 
https://issues.apache.org/jira/browse/GROOVY-10307)
But I would be happy to learn I did something wrong, and this problem 
does indeed have such a simple solution, so it would be great if you 
could try your optimizations with my sample, to confirm that they indeed 
lead to the performance Groovy 3 exhibits :-)

Cheers,
mg


On 07/05/2022 18:12, Daniel Sun wrote:
> Hi mg,
>
>      Groovy 4 enables indy, i.e. invokedynamic by default, and the default threshold for optimization of indy is 10,000. We can adjust its value by the system property "groovy.indy.optimize.threshold", e.g. "-Dgroovy.indy.optimize.threshold=0". When the related code is executed more than threshold, the optimization will be applied, and the performance will be better.
>
>      BTW, the optimization is double-edged sword, which **may** result in de-opt storm, in other word, optimize -> de-optimize -> optimize -> de-optimize -> ... -> optimize. As we can see, the de-opt storm will result in very very poor performance. In order to solve the de-opt storm issue, the threshold for optimization is introduced and its default value is 10,000.
>
> Cheers,
> Daniel Sun
> On 2022/05/06 23:55:19 MG wrote:
>> Hi Groovy devs,
>>
>> I have finally managed to extract a sample from our Groovy framework
>> that only uses our most basic modules and still exhibits the 2-3x
>> performance degradation between (non-indy) Groovy 3 and Groovy 4 I
>> already described in multiple posts a while ago.
>> The code was built using the latest Groovy releases (3.0.10 / 4.0.1),
>> you can find the 2 JARs & a test script at:
>>
>> https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample
>>
>> To check the performance:
>>
>>   1. Open a GroovyConsole window for Groovy 3.0.10
>>   2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
>>   3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy
>>
>> Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.
>>
>> On 3 different PCs I consistently get results as shown below (each timed
>> test step executes the same code 100x, each time creating the same
>> simple SQL GString using our framework):
>>
>> *Groovy 3.0.10*
>> 0) dt = 7.99 ms
>> 1) dt = 2.01 ms
>> 2) dt = 1.53 ms
>> 3) dt = 1.65 ms
>> 4) dt = 1.36 ms
>>
>> *Groovy 4.0.1*
>> 0) dt = 16.51 ms
>> 1) dt = 7.14 ms
>> 2) dt = 5.83 ms
>> 3) dt = 6.6 ms
>> 4) dt = 6.24 ms
>>
>> Throwing away the first loop, which includes framework setup time,
>> Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my
>> notebook the factor is closer to 2).
>> Execution times generally decrease when starting the script multiple
>> times, but the best dt I have observed on the PC the above measurements
>> were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any
>> case this is irrelevant for practical applications, since here a short,
>> identical code segment will not be executed n x 500-times in a loop).
>>
>> The performance degradation exhibited here is consistent with the
>> performance of a) executing our test suite and b) startup & refresh
>> cycle measurements I did in a web application based on our framwork
>> code, as described before.
>>
>> I did not get anywhere trying to use VisualVM trying to pinpoint where
>> exactly performance was lost, apart from what I described in a previous
>> post.
>>
>> Groovy 3 exhibits the same performance degradation when used in indy
>> mode, so it seems logical to assume that this mechanism is somehow to blame.
>> Based on this observation I tried to refactor parts of our framework
>> which create a lot of new objects / do a lot of method calls, but even
>> short-circuiting these did not lead to any relevant change in performance.
>>
>> So this is hoping this finally supplies the information that leads to a
>> solution to this problem by someone that has knowledge about the inner
>> workings of Groovy & the JVM invokedynamic mechanism, and un-bars us
>> from ever switching to Groovy 4  G-)
>>
>> Cheers,
>> mg
>>
>>
>>
>>
>>
>>
>>

Re: Groovy 3 -> 4 Performance Degradation Sample

Posted by Daniel Sun <su...@apache.org>.

Hi mg,

    Groovy 4 enables indy, i.e. invokedynamic by default, and the default threshold for optimization of indy is 10,000. We can adjust its value by the system property "groovy.indy.optimize.threshold", e.g. "-Dgroovy.indy.optimize.threshold=0". When the related code is executed more than threshold, the optimization will be applied, and the performance will be better.

    BTW, the optimization is double-edged sword, which **may** result in de-opt storm, in other word, optimize -> de-optimize -> optimize -> de-optimize -> ... -> optimize. As we can see, the de-opt storm will result in very very poor performance. In order to solve the de-opt storm issue, the threshold for optimization is introduced and its default value is 10,000.

Cheers,
Daniel Sun
On 2022/05/06 23:55:19 MG wrote:
> Hi Groovy devs,
> 
> I have finally managed to extract a sample from our Groovy framework 
> that only uses our most basic modules and still exhibits the 2-3x 
> performance degradation between (non-indy) Groovy 3 and Groovy 4 I 
> already described in multiple posts a while ago.
> The code was built using the latest Groovy releases (3.0.10 / 4.0.1), 
> you can find the 2 JARs & a test script at:
> 
> https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample
> 
> To check the performance:
> 
>  1. Open a GroovyConsole window for Groovy 3.0.10
>  2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
>  3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy
> 
> Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.
> 
> On 3 different PCs I consistently get results as shown below (each timed 
> test step executes the same code 100x, each time creating the same 
> simple SQL GString using our framework):
> 
> *Groovy 3.0.10*
> 0) dt = 7.99 ms
> 1) dt = 2.01 ms
> 2) dt = 1.53 ms
> 3) dt = 1.65 ms
> 4) dt = 1.36 ms
> 
> *Groovy 4.0.1*
> 0) dt = 16.51 ms
> 1) dt = 7.14 ms
> 2) dt = 5.83 ms
> 3) dt = 6.6 ms
> 4) dt = 6.24 ms
> 
> Throwing away the first loop, which includes framework setup time, 
> Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my 
> notebook the factor is closer to 2).
> Execution times generally decrease when starting the script multiple 
> times, but the best dt I have observed on the PC the above measurements 
> were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any 
> case this is irrelevant for practical applications, since here a short, 
> identical code segment will not be executed n x 500-times in a loop).
> 
> The performance degradation exhibited here is consistent with the 
> performance of a) executing our test suite and b) startup & refresh 
> cycle measurements I did in a web application based on our framwork 
> code, as described before.
> 
> I did not get anywhere trying to use VisualVM trying to pinpoint where 
> exactly performance was lost, apart from what I described in a previous 
> post.
> 
> Groovy 3 exhibits the same performance degradation when used in indy 
> mode, so it seems logical to assume that this mechanism is somehow to blame.
> Based on this observation I tried to refactor parts of our framework 
> which create a lot of new objects / do a lot of method calls, but even 
> short-circuiting these did not lead to any relevant change in performance.
> 
> So this is hoping this finally supplies the information that leads to a 
> solution to this problem by someone that has knowledge about the inner 
> workings of Groovy & the JVM invokedynamic mechanism, and un-bars us 
> from ever switching to Groovy 4  G-)
> 
> Cheers,
> mg
> 
> 
> 
> 
> 
> 
>