You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bdrillard <gi...@git.apache.org> on 2017/10/17 18:20:08 UTC
[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
GitHub user bdrillard opened a pull request:
https://github.com/apache/spark/pull/19518
[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State Compaction
## What changes were proposed in this pull request?
This PR is the part two followup to #18075, meant to address [SPARK-18016](https://github.com/apache/spark/pull/SPARK-18016), Constant Pool limit exceptions. Part 1 implemented `NestedClass` code splitting, in which excess code was split off into nested private sub-classes of the `OuterClass`. In Part 2 we address excess mutable state, in which the number of inlined variables declared at the top of the `OuterClass` can also exceed the constant pool limit.
Here, we modify the `addMutableState` function in the `CodeGenerator` to check if the declared state can be easily initialized compacted into an array and initialized in loops rather than inlined and initialized with its own line of code. We identify four types of state that can compacted:
* Primitive state (ints, booleans, etc)
* Object state of like-type without any initial assignment
* Object state of like-type initialized to `null`
* Object state of like-type initialized to the type's base (no-argument) constructor
With mutable state compaction, at the top of the class we generate array declarations like:
```
private Object[] references;
private UnsafeRow result;
private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
...
private boolean[] mutableStateArray1 = new boolean[12507];
private InternalRow[] mutableStateArray4 = new InternalRow[5268];
private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] mutableStateArray5 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[7663];
private java.lang.String[] mutableStateArray2 = new java.lang.String[12477];
private int[] mutableStateArray = new int[42509];
private java.lang.Object[] mutableStateArray6 = new java.lang.Object[30];
private boolean[] mutableStateArray3 = new boolean[10536];
```
and these arrays are initialized in loops as:
```
private void init_3485() {
for (int i = 0; i < mutableStateArray3.length; i++) {
mutableStateArray3[i] = false;
}
}
```
For compacted mutable state, `addMutableState` returns an array accessor value, which is then referenced in the subsequent generated code.
**Note**: some state cannot be easily compacted (except without perhaps deeper changes to generating code), as some state value names are taken for granted at the global level during code generation (see `CatalystToExternalMap` in `Objects` as an example). For this state, we provide an `inline` hint to the function call, which indicates that the state should be inlined to the `OuterClass`. Still, the state we can easily compact manages to reduce the Constant Pool to an tractable size for the wide/deeply nested schemas I was able to test against.
## How was this patch tested?
Tested against several complex schema types, also added a test case generating 40,000 string columns and creating the `UnsafeProjection`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/bdrillard/spark state_compaction
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19518.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19518
----
commit 081bc5de6ee55e00ff58c4abddc347f77c29d4aa
Author: ALeksander Eskilson <al...@cerner.com>
Date: 2017-10-17T14:06:12Z
adding state compaction
commit e7046c3d3bb528f18b3183d81e8bc26720a8baf7
Author: ALeksander Eskilson <al...@cerner.com>
Date: 2017-10-17T16:54:54Z
adding inline changes
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
| what do you mean by this? using array can't reduce constant pool size at all?
Not at all. However, when array index for an access is greater than 32768, the access requires constant pool entry. This is because integer constant of 32768 or greater uses `ldc` java bytecode instruction [[ref]](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ldc)[[ref]](https://cs.au.dk/~mis/dOvs/jvmspec/ref-_ldc.html).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array. In summary, the followings are performance results (**small number is better**).
- 0.65: flat global variables
- 0.91: inner global variables
- 1: array
WDYT? Any comments are very appreciated.
Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
```
$ cat /proc/cpuinfo | grep "model name" | uniq
model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
$ python myInstance.py > MyInstance.java && javac Test.java && java Test
Result(us): Array
0: 462848.969
1: 461978.693
2: 463174.459
3: 461422.763
4: 460563.915
5: 460112.262
6: 460059.957
7: 460376.230
8: 460245.445
9: 460308.775
10: 460154.955
11: 460005.629
12: 460330.584
13: 460277.612
14: 460181.360
15: 460168.843
16: 459790.137
17: 460248.481
18: 460344.471
19: 460084.529
20: 459987.263
21: 459961.639
22: 459952.447
23: 460128.518
24: 460025.783
25: 459874.303
26: 459932.685
27: 460065.736
28: 459954.526
29: 459972.679
BEST: 459790.137000, AVG: 460417.788
Result(us): InnerVars
0: 421013.480
1: 420279.235
2: 419366.157
3: 421015.934
4: 419540.049
5: 420316.650
6: 419816.612
7: 420211.140
8: 420215.864
9: 421104.657
10: 421836.430
11: 420866.894
12: 421457.850
13: 421734.506
14: 420796.010
15: 419832.910
16: 420012.167
17: 420821.800
18: 420962.178
19: 421981.676
20: 421721.257
21: 419996.594
22: 419742.884
23: 420158.066
24: 420156.773
25: 420325.231
26: 420966.914
27: 420787.147
28: 420296.789
29: 420520.843
BEST: 419366.157, AVG: 420595.157
Result(us): Vars
0: 343490.797
1: 342849.079
2: 341990.967
3: 342844.044
4: 343484.681
5: 342586.419
6: 342468.883
7: 343113.300
8: 343516.875
9: 343002.395
10: 341499.538
11: 342192.102
12: 341847.383
13: 342533.215
14: 341376.556
15: 342018.111
16: 341316.445
17: 342043.378
18: 341969.932
19: 343415.854
20: 343103.133
21: 342084.686
22: 341555.293
23: 342984.355
24: 342302.336
25: 341994.372
26: 342475.639
27: 342281.214
28: 342205.175
29: 342462.032
BEST: 341316.445, AVG: 342433.606
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19518
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk Ah, thanks for the link back to that discussion. I'll make modifications to the trials for better data.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
can we use `32767` as array size upper bound? e.g.
```
class Foo {
int[] globalVars1 = new int[32767];
int[] globalVars2 = new int[32767];
int[] globalVars3 = new int[32767];
...
void apply0(InternalRow i) {
globalVars1[0] = 1;
globalVars1[1] = 1;
...
}
void apply1(InternalRow i) {
globalVars2[0] = 1;
globalVars2[1] = 1;
...
}
void apply(InternalRow i) {
apply0(i);
apply1(i);
...
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19518
@viirya in general there is no benefit. There will be a benefit if we manage to declare them where they are used, but I am not sure this is feasible. In this way, they do not add any entry to the constant pool.
For instance, if we have a inner class `InnerClass1` and we use `isNull_11111` only there, if we define `isNull_11111` as a variable of `InnerClass1` instead of a variable of the outer class we have no entry about it in the outer class.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
Next, I analyzed usage of constant pool entries and java bytecode ops using `janinoc`. The summary is as follows:
```
array[4] : 6 + 0 * n entries, 6-8 java bytecode ops / read access
outerInstance: 3 + 3 * n entries, 5 java bytecode ops / read access
innerInsntace: 9 + 3 * n entries, 6 java bytecode ops / read access
```
```
public class CP {
int[] a = new int[1000000];
int globalVar0;
int globalVar1;
private Inner inner = new Inner();
private class Inner {
int nestedVar0;
int nestedVar1;
}
void access() {
a[4] = 0;
a[5] = 0;
globalVar0 = 0;
globalVar1 = 0;
inner.nestedVar0 = 0;
inner.nestedVar1 = 0;
}
static public void main(String[] argv) {
CP cp = new CP();
cp.access();
}
}
```
Java bytecode
```
void access();
descriptor: ()V
Code:
stack=3, locals=1, args_size=1
0: aload_0
1: getfield #12 // Field a:[I
4: iconst_4
5: iconst_0
6: iastore
7: aload_0
8: getfield #12 // Field a:[I
11: iconst_5
12: iconst_0
13: iastore
14: aload_0
15: iconst_0
16: putfield #16 // Field globalVar0:I
19: aload_0
20: iconst_0
21: putfield #19 // Field globalVar1:I
24: aload_0
25: getfield #23 // Field inner:LCP$Inner;
28: iconst_0
29: putfield #28 // Field CP$Inner.nestedVar0:I
32: aload_0
33: getfield #23 // Field inner:LCP$Inner;
36: iconst_0
37: putfield #31 // Field CP$Inner.nestedVar1:I
40: return
```
Constant pool
```
#1 = Utf8 CP
#2 = Class #1 // CP
#9 = Utf8 a
#10 = Utf8 [I
#11 = NameAndType #9:#10 // a:[I
#12 = Fieldref #2.#11 // CP.a:[I
#13 = Utf8 globalVar0
#14 = Utf8 I
#15 = NameAndType #13:#14 // globalVar0:I
#16 = Fieldref #2.#15 // CP.globalVar0:I
#17 = Utf8 globalVar1
#18 = NameAndType #17:#14 // globalVar1:I
#19 = Fieldref #2.#18 // CP.globalVar1:I
#20 = Utf8 inner
#21 = Utf8 LCP$Inner;
#22 = NameAndType #20:#21 // inner:LCP$Inner;
#23 = Fieldref #2.#22 // CP.inner:LCP$Inner;
#24 = Utf8 CP$Inner
#25 = Class #24 // CP$Inner
#26 = Utf8 nestedVar0
#27 = NameAndType #26:#14 // nestedVar0:I
#28 = Fieldref #25.#27 // CP$Inner.nestedVar0:I
#29 = Utf8 nestedVar1
#30 = NameAndType #29:#14 // nestedVar1:I
#31 = Fieldref #25.#30 // CP$Inner.nestedVar1:I
#32 = Utf8 LineNumberTable
#33 = Utf8 Code
#34 = Utf8 main
#35 = Utf8 ([Ljava/lang/String;)V
#36 = Utf8 <init>
#37 = NameAndType #36:#8 // "<init>":()V
#38 = Methodref #2.#37 // CP."<init>":()V
#39 = NameAndType #7:#8 // access:()V
#40 = Methodref #2.#39 // CP.access:()V
#41 = Methodref #4.#37 // java/lang/Object."<init>":()V
#42 = Integer 1000000
#43 = Utf8 (LCP;)V
#44 = NameAndType #36:#43 // "<init>":(LCP;)V
#45 = Methodref #25.#44 // CP$Inner."<init>":(LCP;)V
#46 = Utf8 Inner
#47 = Utf8 InnerClasses
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@cloud-fan Is it better to use this PR? Or, create a new PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19518#discussion_r150220400
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ---
@@ -801,12 +801,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
private[this] def castToByteCode(from: DataType, ctx: CodegenContext): CastFunction = from match {
case StringType =>
val wrapper = ctx.freshName("wrapper")
- ctx.addMutableState("UTF8String.IntWrapper", wrapper,
+ val wrapperAccessor = ctx.addMutableState("UTF8String.IntWrapper", wrapper,
--- End diff --
I'd like to have something like
```
val wrapper = ctx.addMutableState("UTF8String.IntWrapper", v => s"$v = new UTF8String.IntWrapper();")
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
I'd prefer inner class approach.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
For the strategy, I'd like to give priority to primitive values to be in flat global variables. We also need to decide the priority between primitive types, according to which type has largest performance difference between flat global variable and array, and which type is used more frequently(may be boolean).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19518
@viirya if you take a look at the example I posted, you can see that we are not saving either `NameAndType` or `Fieldref`, thus think the only solution to save constant pool entries we have found so far is to use arrays.
What may be interesting IMHO, is to evaluate where we are using a variable. Since when we have a lot of instance variables we are very likely to have also several inner classes (for splitting the methods), I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all. @kiszk what do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/19518
@bdrillard can you close this pr?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard closed the pull request at:
https://github.com/apache/spark/pull/19518
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@bdrillard I remember that we had the similar discussion about benchmarking. Could you see [this discussion](https://github.com/apache/spark/pull/16648#discussion_r118043056)?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk thanks for your great analysis. May I have just a couple of additional questions?
1 - In all your tests, which compiler are you using? Because I see that you are linking to the Oracle docs and maybe you are using `javac` for your tests, but in my tests (made for other cases) I realized that `janinoc` works a bit differently and what is true for `javac` may not be for `janinoc`.
2 - If the problem with array occurs when we go beyond 32767, what about creating many arrays with max size 32767? I see that this is not a definitive solution and still we have some limitations, but dividing the number of constant pool entries by 32767 looks a very good achievement to me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@cloud-fan I want to take this over if possible
cc @maropu
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
You are comparing array vs member variables, can we compare array vs inner class member variable? And too many classes will have overhead on the classloader, we should test some extreme cases like 1 million variables.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:
https://github.com/apache/spark/pull/19518#discussion_r145213781
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
@@ -801,7 +908,10 @@ class CodegenContext {
addNewFunction(name, code)
}
- foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
+ val exprs = transformFunctions(functions.map(name =>
+ s"$name(${arguments.map(_._2).mkString(", ")})"))
+
+ splitExpressions(exprs, funcName, arguments)
--- End diff --
Changes I made here to `splitExpressions` were to handle instances where the split code method references were still over 64kb. It would seem this problem is addressed by @mgaido91 in #19480, and that implementation is much more thorough, so if that PR gets merged, I'd prefer to rebase against that.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19518
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
ping @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@cloud-fan you are right, I am updating benchmark program and results.
I realized that one issue of the array approach: **limitation of constant pool entries due to array access index**
When we use an array approach, a global variable will be accessed by `this.globalVar[12345]`. Here is a bytecode sequence. Each access to an array element (index is greater than 5 since iconst_0 ... iconst_5 do not use constant pool entry) requires one constant pool entry.
While we reduce one constant pool entry for global variable, we require one constant pool entry.
@bdrillard did your implementation (probably around [here](https://github.com/apache/spark/pull/19518/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR218)) avoid this issue?
WDYT? cc @viirya @maropu
```
aload 0 // load this
getfield [constant pool index] // load this.globalVar
ldc [constant pool index] // load 12345 from constant pool and push it
iaload
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
@bdrillard thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
> When we use an inner class approach, we still require constant pool entry for accessing instance variables (e.g. `this.inner001.globalVar55555) in one class.
@kiszk But we still can save the name/type and field name for global variable?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@bdrillard @cloud-fan @maropu
I created and run a benchmark program. I think that to use an array for a compaction is slower than to use scalar instance variables. In the following my case, 20% slower in the best time.
Thus, I would like to use an approach to create inner classes to keep in scalar instance variables.
WDYT? Any comments are very appreciated.
Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
```
$ cat /proc/cpuinfo | grep "model name" | uniq
model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
$ python myInstance.py > MyInstance.java && javac Test.java && java Test
Result(us): Array
0: 145333.227
1: 144288.262
2: 144233.871
3: 144536.350
4: 144503.269
5: 144836.117
6: 144448.053
7: 144744.725
8: 144688.652
9: 144727.823
10: 144447.789
11: 144500.638
12: 144641.592
13: 144464.106
14: 144518.914
15: 144844.639
16: 144780.464
17: 144617.363
18: 144463.271
19: 144508.170
20: 144929.451
21: 144529.697
22: 144273.167
23: 144362.926
24: 144296.854
25: 144398.665
26: 144490.813
27: 144435.732
28: 144675.997
29: 144483.581
BEST: 144233.871000, AVG: 144566.806
Result(us): Vars
0: 120375.384
1: 119800.238
2: 119822.842
3: 119830.761
4: 119836.781
5: 120185.751
6: 120208.140
7: 120274.925
8: 120112.109
9: 120082.120
10: 120063.456
11: 120112.493
12: 120144.937
13: 119964.356
14: 119941.633
15: 119825.758
16: 119677.506
17: 119833.236
18: 119749.781
19: 119723.932
20: 120197.394
21: 120052.820
22: 120006.650
23: 119939.335
24: 119857.469
25: 120176.229
26: 120153.605
27: 120345.581
28: 120163.129
29: 120038.673
BEST: 119677.506, AVG: 120016.567
```
Small MyInstance.java (N = 16, M = 4)
```
class MyInstance {
final int N = 16;
int[] instance = new int[N];
void accessArrays00000() {
instance[8] = instance[0];
instance[9] = instance[1];
instance[10] = instance[2];
instance[11] = instance[3];
}
void accessArrays00001() {
instance[12] = instance[4];
instance[13] = instance[5];
instance[14] = instance[6];
instance[15] = instance[7];
}
void accessArrays00002() {
instance[0] = instance[8];
instance[1] = instance[9];
instance[2] = instance[10];
instance[3] = instance[11];
}
void accessArrays00003() {
instance[4] = instance[12];
instance[5] = instance[13];
instance[6] = instance[14];
instance[7] = instance[15];
}
void accessArray() {
accessArrays00000();
accessArrays00001();
accessArrays00002();
accessArrays00003();
}
int instance00000;
int instance00001;
int instance00002;
int instance00003;
int instance00004;
int instance00005;
int instance00006;
int instance00007;
int instance00008;
int instance00009;
int instance00010;
int instance00011;
int instance00012;
int instance00013;
int instance00014;
int instance00015;
void accessVars00000() {
instance00008 = instance00000;
instance00009 = instance00001;
instance00010 = instance00002;
instance00011 = instance00003;
}
void accessVars00001() {
instance00012 = instance00004;
instance00013 = instance00005;
instance00014 = instance00006;
instance00015 = instance00007;
}
void accessVars00002() {
instance00000 = instance00008;
instance00001 = instance00009;
instance00002 = instance00010;
instance00003 = instance00011;
}
void accessVars00003() {
instance00004 = instance00012;
instance00005 = instance00013;
instance00006 = instance00014;
instance00007 = instance00015;
}
void accessVars() {
accessVars00000();
accessVars00001();
accessVars00002();
accessVars00003();
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
let's create a new PR
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/19518
This PR was addressed by #19811, closing this one.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/19518
Thanks for giving this the attention to shepard it on through. I haven't had the time to do the additional coding work necessary to properly benchmark it in the last few weeks. @kiszk, if there are any questions in regards to my earlier implementation as you make/review the second PR, I'm happy to make clarifications and would be able to respond to those in writing quickly.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
The hybrid approach sounds reasonable to me. Any special strategy to use to decide which fields are global variables and which are in array?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk You are correct that the current implementation compacts all mutable state (where the state does not have to be _explicitly_ inlined).
To your last question, I'd attempted some analysis of the JVM bytecode of array versus inlined state initialized either through method calls or in loops. I'd posted the experiment and results: https://github.com/bdrillard/bytecode-poc
If Spark has its own benchmarking tools, I'd be happy to use those to compare Catalyst-generated classes further.
To the general question of _when_ we compact state, I think some kind of threshold still does makes sense. It would be best to ensure that the typical code path (for typical Dataset schemas) remains un-impacted by the changes (as was the aim when generating nested classes in #18075).
I've found trying to set a global threshold for when to compact mutable state can be hard. Some state _has_ to be inlined (state that uses parameterized constructors that can't be easily initialized with loops, like the `BufferHolder` and `UnsafeRowWriter`). I've found situations where, due to code generator flow, we began by inlining an amount of state that _could have been_ compacted, then started compacting state as after a set threshold, but then began inlining state again that _could not be_ compacted, forcing us over the constant pool limit.
It's difficult to tell when a certain piece of state will be referenced frequently or infrequently. For example, we do know some pieces of primitive mutable state, like global booleans that are part of conditional checks, are initialized globally, assigned once in one method, and then referenced only once in a separate caller method. These are excellent candidates for compaction, since they proliferate very quickly and are, in a sense, "only used once" (declared, initialized, re-assigned in a method, accessed in another method, never used again).
Other pieces of state, like row objects, and JavaBean objects, will be accessed a number of times relative to how many fields they have, which isn't necessarily easy info to retrieve during code generation (we'd have to reflect or do inspection of the initialization code to know how many fields such an object has). But these items are probably still good candidates for compaction in general because of how many of a given type there could be.
I'm inclined to use a threshold against the name/types of the state, rather than a global threshold. Since `freshName` is always monotonically increasing from 1 for a given variable prefix, we could know when a threshold for state of that type was reached, and when we could begin compacting that type of state, independently/concurrently with the other types of state. Such a scheme would allow us to ensure the usual flow of code-generation remains as it is now, with no state-compaction for typical operations, and then with state-compaction in the more extreme cases that would threaten to blow the Constant Pool limit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk I meant that `janinoc` creates a slightly different constant pool from `javac`. I am not sure about performances, but the number of constant pool entries is definitely different. For instance, let's take this example and analyze the `Outer` class constant pool:
```
class Outer {
private Inner innerInstance = new Inner();
private void outerMethod(){
innerInstance.b = 1;
innerInstance.a = 1;
}
private boolean outerMethod2(){
return innerInstance.b > innerInstance.a;
}
private class Inner {
int b =2;
private int a = 0;
}
}
```
if you compile it with `javac`, the constant pool will be:
```
Constant pool:
#1 = Methodref #9.#25 // java/lang/Object."<init>":()V
#2 = Class #26 // Outer$Inner
#3 = Methodref #2.#27 // Outer$Inner."<init>":(LOuter;LOuter$1;)V
#4 = Fieldref #8.#28 // Outer.innerInstance:LOuter$Inner;
#5 = Fieldref #2.#29 // Outer$Inner.b:I
#6 = Methodref #2.#30 // Outer$Inner.access$102:(LOuter$Inner;I)I
#7 = Methodref #2.#31 // Outer$Inner.access$100:(LOuter$Inner;)I
#8 = Class #32 // Outer
#9 = Class #33 // java/lang/Object
#10 = Class #34 // Outer$1
#11 = Utf8 InnerClasses
#12 = Utf8 Inner
#13 = Utf8 innerInstance
#14 = Utf8 LOuter$Inner;
#15 = Utf8 <init>
#16 = Utf8 ()V
#17 = Utf8 Code
#18 = Utf8 LineNumberTable
#19 = Utf8 outerMethod
#20 = Utf8 outerMethod2
#21 = Utf8 ()Z
#22 = Utf8 StackMapTable
#23 = Utf8 SourceFile
#24 = Utf8 Outer.java
#25 = NameAndType #15:#16 // "<init>":()V
#26 = Utf8 Outer$Inner
#27 = NameAndType #15:#35 // "<init>":(LOuter;LOuter$1;)V
#28 = NameAndType #13:#14 // innerInstance:LOuter$Inner;
#29 = NameAndType #36:#37 // b:I
#30 = NameAndType #38:#39 // access$102:(LOuter$Inner;I)I
#31 = NameAndType #40:#41 // access$100:(LOuter$Inner;)I
#32 = Utf8 Outer
#33 = Utf8 java/lang/Object
#34 = Utf8 Outer$1
#35 = Utf8 (LOuter;LOuter$1;)V
#36 = Utf8 b
#37 = Utf8 I
#38 = Utf8 access$102
#39 = Utf8 (LOuter$Inner;I)I
#40 = Utf8 access$100
#41 = Utf8 (LOuter$Inner;)I
```
(please note that it creates a fake getter and a fake setter method entries for the `private` inner variable `a`).
If you compile the same class with `janinoc`, instead, the constant pool will be:
```
Constant pool:
#1 = Utf8 Outer
#2 = Class #1 // Outer
#3 = Utf8 java/lang/Object
#4 = Class #3 // java/lang/Object
#5 = Utf8 SourceFile
#6 = Utf8 Outer.java
#7 = Utf8 outerMethod$
#8 = Utf8 (LOuter;)V
#9 = Utf8 innerInstance
#10 = Utf8 LOuter$Inner;
#11 = NameAndType #9:#10 // innerInstance:LOuter$Inner;
#12 = Fieldref #2.#11 // Outer.innerInstance:LOuter$Inner;
#13 = Utf8 Outer$Inner
#14 = Class #13 // Outer$Inner
#15 = Utf8 b
#16 = Utf8 I
#17 = NameAndType #15:#16 // b:I
#18 = Fieldref #14.#17 // Outer$Inner.b:I
#19 = Utf8 a
#20 = NameAndType #19:#16 // a:I
#21 = Fieldref #14.#20 // Outer$Inner.a:I
#22 = Utf8 LineNumberTable
#23 = Utf8 Code
#24 = Utf8 outerMethod2$
#25 = Utf8 (LOuter;)Z
#26 = Utf8 <init>
#27 = Utf8 ()V
#28 = NameAndType #26:#27 // "<init>":()V
#29 = Methodref #4.#28 // java/lang/Object."<init>":()V
#30 = NameAndType #26:#8 // "<init>":(LOuter;)V
#31 = Methodref #14.#30 // Outer$Inner."<init>":(LOuter;)V
#32 = Utf8 Inner
#33 = Utf8 InnerClasses
```
(note that `a` now is considered as a regular field).
Thus in all our tests we should use `janinoc` instead of `javac` to evaluate the behavior of the constant pool in the different cases.
Let me know if you have any question or doubt. Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
The PR looks good.
Although it can solve the constant pool pressure, however, I have a question. Does it mean every time a constant falling in short range is used in codes, we increase 1 byte in bytecodes because sipush is followed by two byte value. I'm afraid it may be negative to some cases like many short constants in a java program.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
In int case, the followings are the number of java byte code length for a value
1 byte: -1, 0, 1, 2, 3, 4, 5 (iconst_?)
2 byte: -128 ~ -2, 6 ~ 127 (bipush)
3 bytes: -32768 ~ -129, 128 ~ 32767 (sipush)
4 or 5 bytes: others (ldc or ldc_w)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
Btw, can we config the maximum number of global variables?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
Here is [a PR](https://github.com/janino-compiler/janino/pull/34) to fix a problem regarding `sipush`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array using janinoc for target Java file.
The performance is not much different from the previous one. In summary, the followings are performance results (**small number is better**).
- 1: array
- 0.90: inner global variables
- 0.81: flat global variables
WDYT? Any comments are very appreciated.
Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
```
$ cat /proc/cpuinfo | grep "model name" | uniq
model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
$ python myInstance.py > MyInstance.java && janinoc MyInstance.java && javac Test.java && java -Xmx16g Test
Result(us): Array
0: 484251.446
1: 483374.255
2: 483956.692
3: 482498.241
4: 483602.261
5: 482654.567
6: 482896.671
7: 483458.625
8: 483194.317
9: 483387.234
10: 484103.729
11: 483536.493
12: 483790.828
13: 483590.991
14: 483993.488
15: 483455.164
16: 484040.009
17: 483225.837
18: 483126.520
19: 484105.989
20: 484988.935
21: 483766.245
22: 483667.930
23: 483271.499
24: 483071.606
25: 483174.438
26: 483602.474
27: 483210.405
28: 483907.061
29: 483071.964
BEST: 482498.241000, AVG: 483532.530
Result(us): InnerVars
0: 437016.533
1: 436125.481
2: 436360.534
3: 435857.758
4: 436166.243
5: 437089.913
6: 436168.359
7: 435570.397
8: 435550.848
9: 435256.088
10: 435252.679
11: 435765.156
12: 435646.739
13: 437303.993
14: 435315.530
15: 435752.545
16: 434857.606
17: 436776.190
18: 435444.877
19: 435657.649
20: 436248.147
21: 436322.998
22: 437214.262
23: 435907.223
23: 435907.223
24: 435431.025
25: 435274.317
26: 435412.202
27: 435670.321
28: 436494.045
29: 436347.838
BEST: 434857.606, AVG: 435975.250
Result(us): Vars
0: 353983.048
1: 354067.690
2: 353138.178
3: 354093.115
4: 354067.180
5: 352750.571
6: 353672.510
7: 355179.115
8: 353296.750
9: 354522.113
10: 355221.301
11: 355178.172
12: 353859.319
13: 353539.817
14: 352703.352
15: 353923.981
16: 354442.744
17: 355523.145
18: 354849.122
19: 354082.888
20: 354673.504
21: 355526.218
22: 355264.029
23: 355455.492
24: 355520.322
25: 353923.520
26: 353796.600
27: 355021.849
28: 355800.387
29: 353810.567
BEST: 352703.352, AVG: 354362.887
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
> Each access to an array element requires one constant pool entry.
what do you mean by this? using array can't reduce constant pool size at all?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@mgaido91 Thank you for your questions.
1. I am using `javac` as shown. I am sorry that I cannot understand what you are pointing out. In this benchmark, what are differences between `javac` and `janinoc`?
2. I agree with you.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
Good new is [the issue](https://github.com/janino-compiler/janino/issues/33) in janino has been quickly fixed. Bad new is no official date to release the next version now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/19518
I like the latest @kiszk hybrid idea in terms of performance and readability. Also, this is a corner case, so I don't want affect most regular small queries.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
First of all, I have to share sad news with you. `janino` does not use `sipush` for values from 128 to 32767. Current `janino` 3.0.7 uses `iconst...`, `bipush`, or `ldc`. `javac` uses `sipush` for values from 128 to 32767. In other words, if index is greater than 127, one constant pool is used by bytecode compiled by `janino`. It should be fixed.
```
public class Array {
int[] a = new int[1000000];
void access() {
a[5] = 0;
a[6] = 0;
a[127] = 0;
a[128] = 0;
a[1023] = 0;
a[16383] = 0;
a[32767] = 0;
a[32768] = 0;
}
static public void main(String[] argv) {
Array a = new Array();
a.access();
}
}
```
```
void access();
descriptor: ()V
flags:
Code:
stack=3, locals=1, args_size=1
0: aload_0
1: getfield #12 // Field a:[I
4: iconst_5
5: iconst_0
6: iastore
7: aload_0
8: getfield #12 // Field a:[I
11: bipush 6
13: iconst_0
14: iastore
15: aload_0
16: getfield #12 // Field a:[I
19: bipush 127
21: iconst_0
22: iastore
23: aload_0
24: getfield #12 // Field a:[I
27: ldc #13 // int 128
29: iconst_0
30: iastore
31: aload_0
32: getfield #12 // Field a:[I
35: ldc #14 // int 1023
37: iconst_0
38: iastore
39: aload_0
40: getfield #12 // Field a:[I
43: ldc #15 // int 16383
45: iconst_0
46: iastore
47: aload_0
48: getfield #12 // Field a:[I
51: ldc #16 // int 32767
53: iconst_0
54: iastore
55: aload_0
56: getfield #12 // Field a:[I
59: ldc #17 // int 32768
61: iconst_0
62: iastore
63: return
Constant pool:
#1 = Utf8 Array
#2 = Class #1 // Array
#9 = Utf8 a
#10 = Utf8 [I
#11 = NameAndType #9:#10 // a:[I
#12 = Fieldref #2.#11 // Array.a:[I
#13 = Integer 128
#14 = Integer 1023
#15 = Integer 16383
#16 = Integer 32767
#17 = Integer 32768
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
@mgaido91 Thank you very much. I did not know such a difference. I will validate with `janinoc` 3.0.0.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
In #19811, first, I will take a hybrid approach of outer (flat) global variable and arrays. The threshold for # of global variables would be configurable.
I will give high priority to primitive variables to place them at the outer class due to performance.
> I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all.
It would be great if we could do this. For now, #19811 will not address this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk @maropu any of you wanna take this over? This patch becomes important as we now split codes more aggressively.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19518
@bdrillard since my PR and other get merged now there are some conflicts, may you please fix them? Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
@mgaido91 Thanks. I looked at the constant pool you posted. It's clear.
Any benefit to declare the variables in the inner classes? Looks like they still occupy constant pool entries?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
> I think ldc is 2 bytes and ldc_w is 3 bytes?
You are right, thanks, updated.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19518
ping @bdrillard
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
OK, I will create a new PR
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19518
> 4 or 5 bytes: others (ldc or ldc_w)
I think ldc is 2 bytes and ldc_w is 3 bytes?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/19518
yea, ok @kiszk I'll review your work.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
Thank you for creating a PR for the latest Spark.
I think that it is great to reduce # of constant pool entries. I have one high level comment.
IIUC, this PR **always** perform mutable state compaction. In other words, mutable states are in arrays.
I am afraid about possible performance degradation due to increasing access cost by putting states in arrays.
What do you think about putting mutable states into arrays (i.e. performing mutable state compaction) only when there are many mutable states or only for certain mutable states that are rarely accessed?
Or, can we say there is no performance degradation due to mutable state compaction?
What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19518
Based on performance results and usage of constant pool entry, I would like to use hybrid approach with flat global variable and array.
For example, first 500 variables are stored into flat global variables, then others are stored into arrays with 32767 elements. I think that most of non-extreme cases can enjoy simple code without array accesses and good performance.
WDYT?
```
class Foo {
int globalVars1;
int globalVars2;
...
int globalVars499;
int globalVars500;
int[] globalArrays1 = new int[32767];
int[] globalArrays2 = new int[32767];
int[] globalArrays3 = new int[32767];
...
void apply1(InternalRow i) {
globalVars1 = 1;
globalVars2 = 1;
...
globalVars499 = 1;
globalVars500 = 1;
}
void apply2(InternalRow i) {
globalArrays1[0] = 1;
globalArrays1[1] = 1;
...
}
void apply(InternalRow i) {
apply0(i);
apply1(i);
apply2(i);
...
}
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org