You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bdrillard <gi...@git.apache.org> on 2017/10/17 18:20:08 UTC

[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

GitHub user bdrillard opened a pull request:

    https://github.com/apache/spark/pull/19518

    [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State Compaction

    ## What changes were proposed in this pull request?
    
    This PR is the part two followup to #18075, meant to address [SPARK-18016](https://github.com/apache/spark/pull/SPARK-18016), Constant Pool limit exceptions. Part 1 implemented `NestedClass` code splitting, in which excess code was split off into nested private sub-classes of the `OuterClass`. In Part 2 we address excess mutable state, in which the number of inlined variables declared at the top of the `OuterClass` can also exceed the constant pool limit. 
    
    Here, we modify the `addMutableState` function in the `CodeGenerator` to check if the declared state can be easily initialized compacted into an array and initialized in loops rather than inlined and initialized with its own line of code. We identify four types of state that can compacted:
    
    * Primitive state (ints, booleans, etc)
    * Object state of like-type without any initial assignment
    * Object state of like-type initialized to `null`
    * Object state of like-type initialized to the type's base (no-argument) constructor
    
    With mutable state compaction, at the top of the class we generate array declarations like:
    
    ```
    private Object[] references;
    private UnsafeRow result;
    private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
    private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
      ...
    private boolean[] mutableStateArray1 = new boolean[12507];
    private InternalRow[] mutableStateArray4 = new InternalRow[5268];
    private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] mutableStateArray5 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[7663];
    private java.lang.String[] mutableStateArray2 = new java.lang.String[12477];
    private int[] mutableStateArray = new int[42509];
    private java.lang.Object[] mutableStateArray6 = new java.lang.Object[30];
    private boolean[] mutableStateArray3 = new boolean[10536];
    ```
    
    and these arrays are initialized in loops as:
    
    ```
    private void init_3485() {
        for (int i = 0; i < mutableStateArray3.length; i++) {
            mutableStateArray3[i] = false;
        }
    }
    ```
    
    For compacted mutable state, `addMutableState` returns an array accessor value, which is then referenced in the subsequent generated code.
    
    **Note**: some state cannot be easily compacted (except without perhaps deeper changes to generating code), as some state value names are taken for granted at the global level during code generation (see `CatalystToExternalMap` in `Objects` as an example). For this state, we provide an `inline` hint to the function call, which indicates that the state should be inlined to the `OuterClass`. Still, the state we can easily compact manages to reduce the Constant Pool to an tractable size for the wide/deeply nested schemas I was able to test against.
    
    ## How was this patch tested?
    
    Tested against several complex schema types, also added a test case generating 40,000 string columns and creating the `UnsafeProjection`. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdrillard/spark state_compaction

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19518
    
----
commit 081bc5de6ee55e00ff58c4abddc347f77c29d4aa
Author: ALeksander Eskilson <al...@cerner.com>
Date:   2017-10-17T14:06:12Z

    adding state compaction

commit e7046c3d3bb528f18b3183d81e8bc26720a8baf7
Author: ALeksander Eskilson <al...@cerner.com>
Date:   2017-10-17T16:54:54Z

    adding inline changes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    | what do you mean by this? using array can't reduce constant pool size at all?
    Not at all. However, when array index for an access is greater than 32768, the access requires constant pool entry. This is because integer constant of 32768 or greater uses `ldc` java bytecode instruction [[ref]](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ldc)[[ref]](https://cs.au.dk/~mis/dOvs/jvmspec/ref-_ldc.html).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array.  In summary, the followings are performance results (**small number is better**).
    - 0.65: flat global variables
    - 0.91: inner global variables
    - 1: array
    
    WDYT? Any comments are very appreciated.
    
    Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
    
    ```
    $ cat /proc/cpuinfo | grep "model name" | uniq
    model name	: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    $ java -version
    openjdk version "1.8.0_131"
    OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
    $ python myInstance.py > MyInstance.java && javac Test.java && java Test
    
    Result(us): Array
       0: 462848.969
       1: 461978.693
       2: 463174.459
       3: 461422.763
       4: 460563.915
       5: 460112.262
       6: 460059.957
       7: 460376.230
       8: 460245.445
       9: 460308.775
      10: 460154.955
      11: 460005.629
      12: 460330.584
      13: 460277.612
      14: 460181.360
      15: 460168.843
      16: 459790.137
      17: 460248.481
      18: 460344.471
      19: 460084.529
      20: 459987.263
      21: 459961.639
      22: 459952.447
      23: 460128.518
      24: 460025.783
      25: 459874.303
      26: 459932.685
      27: 460065.736
      28: 459954.526
      29: 459972.679
    BEST: 459790.137000, AVG: 460417.788
    
    Result(us): InnerVars
       0: 421013.480
       1: 420279.235
       2: 419366.157
       3: 421015.934
       4: 419540.049
       5: 420316.650
       6: 419816.612
       7: 420211.140
       8: 420215.864
       9: 421104.657
      10: 421836.430
      11: 420866.894
      12: 421457.850
      13: 421734.506
      14: 420796.010
      15: 419832.910
      16: 420012.167
      17: 420821.800
      18: 420962.178
      19: 421981.676
      20: 421721.257
      21: 419996.594
      22: 419742.884
      23: 420158.066
      24: 420156.773
      25: 420325.231
      26: 420966.914
      27: 420787.147
      28: 420296.789
      29: 420520.843
    BEST: 419366.157, AVG: 420595.157
    
    Result(us): Vars
       0: 343490.797
       1: 342849.079
       2: 341990.967
       3: 342844.044
       4: 343484.681
       5: 342586.419
       6: 342468.883
       7: 343113.300
       8: 343516.875
       9: 343002.395
      10: 341499.538
      11: 342192.102
      12: 341847.383
      13: 342533.215
      14: 341376.556
      15: 342018.111
      16: 341316.445
      17: 342043.378
      18: 341969.932
      19: 343415.854
      20: 343103.133
      21: 342084.686
      22: 341555.293
      23: 342984.355
      24: 342302.336
      25: 341994.372
      26: 342475.639
      27: 342281.214
      28: 342205.175
      29: 342462.032
    BEST: 341316.445, AVG: 342433.606
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk Ah, thanks for the link back to that discussion. I'll make modifications to the trials for better data.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    can we use `32767` as array size upper bound? e.g.
    ```
    class Foo {
      int[] globalVars1 = new int[32767];
      int[] globalVars2 = new int[32767];
      int[] globalVars3 = new int[32767];
      ...
    
      void apply0(InternalRow i) {
        globalVars1[0] = 1;
        globalVars1[1] = 1;
        ...
      }
      void apply1(InternalRow i) {
        globalVars2[0] = 1;
        globalVars2[1] = 1;
        ...
      }
    
      void apply(InternalRow i) {
        apply0(i);
        apply1(i);
        ...
      }
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @viirya in general there is no benefit. There will be a benefit if we manage to declare them where they are used, but I am not sure this is feasible. In this way, they do not add any entry to the constant pool.
    
    For instance, if we have a inner class `InnerClass1` and we use `isNull_11111` only there, if we define `isNull_11111` as a variable of `InnerClass1` instead of a variable of the outer class we have no entry about it in the outer class.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Next, I analyzed usage of constant pool entries and java bytecode ops using `janinoc`. The summary is as follows:
    ```
    array[4]     : 6 + 0 * n entries, 6-8 java bytecode ops / read access
    outerInstance: 3 + 3 * n entries, 5 java bytecode ops / read access
    innerInsntace: 9 + 3 * n entries, 6 java bytecode ops / read access
    ```
    
    ```
    public class CP {
      int[] a = new int[1000000];
      int globalVar0;
      int globalVar1;
      private Inner inner = new Inner();
    
      private class Inner {
        int nestedVar0;
        int nestedVar1;
      }
    
      void access() {
        a[4] = 0;
        a[5] = 0;
    
        globalVar0 = 0;
        globalVar1 = 0;
    
        inner.nestedVar0 = 0;
        inner.nestedVar1 = 0;
      }
    
      static public void main(String[] argv) {
        CP cp = new CP();
        cp.access();
      }
    }
    ```
    
    Java bytecode
    ```
      void access();
        descriptor: ()V
        Code:
          stack=3, locals=1, args_size=1
             0: aload_0
             1: getfield      #12                 // Field a:[I
             4: iconst_4
             5: iconst_0
             6: iastore
             7: aload_0
             8: getfield      #12                 // Field a:[I
            11: iconst_5
            12: iconst_0
            13: iastore
    
            14: aload_0
            15: iconst_0
            16: putfield      #16                 // Field globalVar0:I
            19: aload_0
            20: iconst_0
            21: putfield      #19                 // Field globalVar1:I
    
            24: aload_0
            25: getfield      #23                 // Field inner:LCP$Inner;
            28: iconst_0
            29: putfield      #28                 // Field CP$Inner.nestedVar0:I
            32: aload_0
            33: getfield      #23                 // Field inner:LCP$Inner;
            36: iconst_0
            37: putfield      #31                 // Field CP$Inner.nestedVar1:I
            40: return
    ```
    
    Constant pool
    ```
       #1 = Utf8               CP
       #2 = Class              #1             // CP
       #9 = Utf8               a
      #10 = Utf8               [I
      #11 = NameAndType        #9:#10         // a:[I
      #12 = Fieldref           #2.#11         // CP.a:[I
    
      #13 = Utf8               globalVar0
      #14 = Utf8               I
      #15 = NameAndType        #13:#14        // globalVar0:I
      #16 = Fieldref           #2.#15         // CP.globalVar0:I
    
      #17 = Utf8               globalVar1
      #18 = NameAndType        #17:#14        // globalVar1:I
      #19 = Fieldref           #2.#18         // CP.globalVar1:I
    
      #20 = Utf8               inner
      #21 = Utf8               LCP$Inner;
      #22 = NameAndType        #20:#21        // inner:LCP$Inner;
      #23 = Fieldref           #2.#22         // CP.inner:LCP$Inner;
    
      #24 = Utf8               CP$Inner
      #25 = Class              #24            // CP$Inner
      #26 = Utf8               nestedVar0
      #27 = NameAndType        #26:#14        // nestedVar0:I
      #28 = Fieldref           #25.#27        // CP$Inner.nestedVar0:I
      
      #29 = Utf8               nestedVar1
      #30 = NameAndType        #29:#14        // nestedVar1:I
      #31 = Fieldref           #25.#30        // CP$Inner.nestedVar1:I
    
      #32 = Utf8               LineNumberTable
      #33 = Utf8               Code
      #34 = Utf8               main
      #35 = Utf8               ([Ljava/lang/String;)V
      #36 = Utf8               <init>
      #37 = NameAndType        #36:#8         // "<init>":()V
      #38 = Methodref          #2.#37         // CP."<init>":()V
      #39 = NameAndType        #7:#8          // access:()V
      #40 = Methodref          #2.#39         // CP.access:()V
      #41 = Methodref          #4.#37         // java/lang/Object."<init>":()V
      #42 = Integer            1000000
      #43 = Utf8               (LCP;)V
      #44 = NameAndType        #36:#43        // "<init>":(LCP;)V
      #45 = Methodref          #25.#44        // CP$Inner."<init>":(LCP;)V
      #46 = Utf8               Inner
      #47 = Utf8               InnerClasses
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @cloud-fan Is it better to use this PR? Or, create a new PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19518#discussion_r150220400
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ---
    @@ -801,12 +801,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
       private[this] def castToByteCode(from: DataType, ctx: CodegenContext): CastFunction = from match {
         case StringType =>
           val wrapper = ctx.freshName("wrapper")
    -      ctx.addMutableState("UTF8String.IntWrapper", wrapper,
    +      val wrapperAccessor = ctx.addMutableState("UTF8String.IntWrapper", wrapper,
    --- End diff --
    
    I'd like to have something like
    ```
    val wrapper = ctx.addMutableState("UTF8String.IntWrapper", v => s"$v = new UTF8String.IntWrapper();")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    I'd prefer inner class approach.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    For the strategy, I'd like to give priority to primitive values to be in flat global variables. We also need to decide the priority between primitive types, according to which type has largest performance difference between flat global variable and array, and which type is used more frequently(may be boolean).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @viirya if you take a look at the example I posted, you can see that we are not saving either `NameAndType` or `Fieldref`, thus  think the only solution to save constant pool entries we have found so far is to use arrays.
    
    What may be interesting IMHO, is to evaluate where we are using a variable. Since when we have a lot of instance variables we are very likely to have also several inner classes (for splitting the methods), I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all. @kiszk  what do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @bdrillard can you close this pr?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard closed the pull request at:

    https://github.com/apache/spark/pull/19518


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @bdrillard I remember that we had the similar discussion about benchmarking. Could you see [this discussion](https://github.com/apache/spark/pull/16648#discussion_r118043056)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk thanks for your great analysis. May I have just a couple of additional questions?
     1 - In all your tests, which compiler are you using? Because I see that you are linking to the Oracle docs and maybe you are using `javac` for your tests, but in my tests (made for other cases) I realized that `janinoc` works a bit differently and what is true for `javac` may not be for `janinoc`.
     2 - If the problem with array occurs when we go beyond 32767, what about creating many arrays with max size 32767? I see that this is not a definitive solution and still we have some limitations, but dividing the number of constant pool entries by 32767 looks a very good achievement to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @cloud-fan I want to take this over if possible
    cc @maropu


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    You are comparing array vs member variables, can we compare array vs inner class member variable? And too many classes will have overhead on the classloader, we should test some extreme cases like 1 million variables.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19518#discussion_r145213781
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,7 +908,10 @@ class CodegenContext {
             addNewFunction(name, code)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val exprs = transformFunctions(functions.map(name =>
    +        s"$name(${arguments.map(_._2).mkString(", ")})"))
    +
    +      splitExpressions(exprs, funcName, arguments)
    --- End diff --
    
    Changes I made here to `splitExpressions` were to handle instances where the split code method references were still over 64kb. It would seem this problem is addressed by @mgaido91 in #19480, and that implementation is much more thorough, so if that PR gets merged, I'd prefer to rebase against that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    ping @cloud-fan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @cloud-fan you are right, I am updating benchmark program and results.
    I realized that one issue of the array approach: **limitation of constant pool entries due to array access index**
    
    When we use an array approach, a global variable will be accessed by `this.globalVar[12345]`. Here is a bytecode sequence. Each access to an array element (index is greater than 5 since iconst_0 ... iconst_5 do not use constant pool entry) requires one constant pool entry.  
    While we reduce one constant pool entry for global variable, we require one constant pool entry.
    @bdrillard did your implementation (probably around [here](https://github.com/apache/spark/pull/19518/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR218)) avoid this issue?
    
    WDYT? cc @viirya @maropu 
    
    ```
    aload 0  // load this
    getfield [constant pool index] // load this.globalVar
    ldc [constant pool index] // load 12345 from constant pool and push it
    iaload
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @bdrillard thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    > When we use an inner class approach, we still require constant pool entry for accessing instance variables (e.g. `this.inner001.globalVar55555) in one class.
    
    @kiszk But we still can save the name/type and field name for global variable?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @bdrillard @cloud-fan @maropu
    I created and run a benchmark program. I think that to use an array for a compaction is slower than to use scalar instance variables. In the following my case, 20% slower in the best time. 
    
    Thus, I would like to use an approach to create inner classes to keep in scalar instance variables.  
    WDYT? Any comments are very appreciated.
    
    Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
    
    ```
    $ cat /proc/cpuinfo | grep "model name" | uniq
    model name	: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    $ java -version
    openjdk version "1.8.0_131"
    OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
    $ python myInstance.py > MyInstance.java && javac Test.java && java Test
    
    
    Result(us): Array
       0: 145333.227
       1: 144288.262
       2: 144233.871
       3: 144536.350
       4: 144503.269
       5: 144836.117
       6: 144448.053
       7: 144744.725
       8: 144688.652
       9: 144727.823
      10: 144447.789
      11: 144500.638
      12: 144641.592
      13: 144464.106
      14: 144518.914
      15: 144844.639
      16: 144780.464
      17: 144617.363
      18: 144463.271
      19: 144508.170
      20: 144929.451
      21: 144529.697
      22: 144273.167
      23: 144362.926
      24: 144296.854
      25: 144398.665
      26: 144490.813
      27: 144435.732
      28: 144675.997
      29: 144483.581
    BEST: 144233.871000, AVG: 144566.806
    
    Result(us): Vars
       0: 120375.384
       1: 119800.238
       2: 119822.842
       3: 119830.761
       4: 119836.781
       5: 120185.751
       6: 120208.140
       7: 120274.925
       8: 120112.109
       9: 120082.120
      10: 120063.456
      11: 120112.493
      12: 120144.937
      13: 119964.356
      14: 119941.633
      15: 119825.758
      16: 119677.506
      17: 119833.236
      18: 119749.781
      19: 119723.932
      20: 120197.394
      21: 120052.820
      22: 120006.650
      23: 119939.335
      24: 119857.469
      25: 120176.229
      26: 120153.605
      27: 120345.581
      28: 120163.129
      29: 120038.673
    BEST: 119677.506, AVG: 120016.567
    ```
    
    Small MyInstance.java (N = 16, M = 4)
    ```
    class MyInstance {
      final int N = 16;
      int[] instance = new int[N];
      void accessArrays00000() {
        instance[8] = instance[0];
        instance[9] = instance[1];
        instance[10] = instance[2];
        instance[11] = instance[3];
      }
      void accessArrays00001() {
        instance[12] = instance[4];
        instance[13] = instance[5];
        instance[14] = instance[6];
        instance[15] = instance[7];
      }
      void accessArrays00002() {
        instance[0] = instance[8];
        instance[1] = instance[9];
        instance[2] = instance[10];
        instance[3] = instance[11];
      }
      void accessArrays00003() {
        instance[4] = instance[12];
        instance[5] = instance[13];
        instance[6] = instance[14];
        instance[7] = instance[15];
      }
      void accessArray() {
        accessArrays00000();
        accessArrays00001();
        accessArrays00002();
        accessArrays00003();
      }
    
      int instance00000;
      int instance00001;
      int instance00002;
      int instance00003;
      int instance00004;
      int instance00005;
      int instance00006;
      int instance00007;
      int instance00008;
      int instance00009;
      int instance00010;
      int instance00011;
      int instance00012;
      int instance00013;
      int instance00014;
      int instance00015;
      void accessVars00000() {
        instance00008 = instance00000;
        instance00009 = instance00001;
        instance00010 = instance00002;
        instance00011 = instance00003;
      }
      void accessVars00001() {
        instance00012 = instance00004;
        instance00013 = instance00005;
        instance00014 = instance00006;
        instance00015 = instance00007;
      }
      void accessVars00002() {
        instance00000 = instance00008;
        instance00001 = instance00009;
        instance00002 = instance00010;
        instance00003 = instance00011;
      }
      void accessVars00003() {
        instance00004 = instance00012;
        instance00005 = instance00013;
        instance00006 = instance00014;
        instance00007 = instance00015;
      }
      void accessVars() {
        accessVars00000();
        accessVars00001();
        accessVars00002();
        accessVars00003();
      }
    }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    let's create a new PR


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    This PR was addressed by #19811, closing this one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Thanks for giving this the attention to shepard it on through. I haven't had the time to do the additional coding work necessary to properly benchmark it in the last few weeks. @kiszk, if there are any questions in regards to my earlier implementation as you make/review the second PR, I'm happy to make clarifications and would be able to respond to those in writing quickly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    The hybrid approach sounds reasonable to me. Any special strategy to use to decide which fields are global variables and which are in array?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk You are correct that the current implementation compacts all mutable state (where the state does not have to be _explicitly_ inlined).
    
    To your last question, I'd attempted some analysis of the JVM bytecode of array versus inlined state initialized either through method calls or in loops. I'd posted the experiment and results: https://github.com/bdrillard/bytecode-poc
    
    If Spark has its own benchmarking tools, I'd be happy to use those to compare Catalyst-generated classes further.
    
    To the general question of _when_ we compact state, I think some kind of threshold still does makes sense. It would be best to ensure that the typical code path (for typical Dataset schemas) remains un-impacted by the changes (as was the aim when generating nested classes in #18075).
    
    I've found trying to set a global threshold for when to compact mutable state can be hard. Some state _has_ to be inlined (state that uses parameterized constructors that can't be easily initialized with loops, like the `BufferHolder` and `UnsafeRowWriter`). I've found situations where, due to code generator flow, we began by inlining an amount of state that _could have been_ compacted, then started compacting state as after a set threshold, but then began inlining state again that _could not be_ compacted, forcing us over the constant pool limit.
    
    It's difficult to tell when a certain piece of state will be referenced frequently or infrequently. For example, we do know some pieces of primitive mutable state, like global booleans that are part of conditional checks, are initialized globally, assigned once in one method, and then referenced only once in a separate caller method. These are excellent candidates for compaction, since they proliferate very quickly and are, in a sense, "only used once" (declared, initialized, re-assigned in a method, accessed in another method, never used again). 
    
    Other pieces of state, like row objects, and JavaBean objects, will be accessed a number of times relative to how many fields they have, which isn't necessarily easy info to retrieve during code generation (we'd have to reflect or do inspection of the initialization code to know how many fields such an object has). But these items are probably still good candidates for compaction in general because of how many of a given type there could be. 
    
    I'm inclined to use a threshold against the name/types of the state, rather than a global threshold. Since `freshName` is always monotonically increasing from 1 for a given variable prefix, we could know when a threshold for state of that type was reached, and when we could begin compacting that type of state, independently/concurrently with the other types of state. Such a scheme would allow us to ensure the usual flow of code-generation remains as it is now, with no state-compaction for typical operations, and then with state-compaction in the more extreme cases that would threaten to blow the Constant Pool limit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk I meant that `janinoc` creates a slightly different constant pool from `javac`. I am not sure about performances, but the number of constant pool entries is definitely different. For instance, let's take this example and analyze the `Outer` class constant pool:
    ```
    class Outer {
      private Inner innerInstance = new Inner();
      private void outerMethod(){
        innerInstance.b = 1;
        innerInstance.a = 1;
      }
      private boolean outerMethod2(){
      	return innerInstance.b > innerInstance.a;
      }
      private class Inner {
        int b =2;
        private int a = 0;
      }
    }
    ```
    if you compile it with `javac`, the constant pool will be:
    ```
    Constant pool:
       #1 = Methodref          #9.#25         // java/lang/Object."<init>":()V
       #2 = Class              #26            // Outer$Inner
       #3 = Methodref          #2.#27         // Outer$Inner."<init>":(LOuter;LOuter$1;)V
       #4 = Fieldref           #8.#28         // Outer.innerInstance:LOuter$Inner;
       #5 = Fieldref           #2.#29         // Outer$Inner.b:I
       #6 = Methodref          #2.#30         // Outer$Inner.access$102:(LOuter$Inner;I)I
       #7 = Methodref          #2.#31         // Outer$Inner.access$100:(LOuter$Inner;)I
       #8 = Class              #32            // Outer
       #9 = Class              #33            // java/lang/Object
      #10 = Class              #34            // Outer$1
      #11 = Utf8               InnerClasses
      #12 = Utf8               Inner
      #13 = Utf8               innerInstance
      #14 = Utf8               LOuter$Inner;
      #15 = Utf8               <init>
      #16 = Utf8               ()V
      #17 = Utf8               Code
      #18 = Utf8               LineNumberTable
      #19 = Utf8               outerMethod
      #20 = Utf8               outerMethod2
      #21 = Utf8               ()Z
      #22 = Utf8               StackMapTable
      #23 = Utf8               SourceFile
      #24 = Utf8               Outer.java
      #25 = NameAndType        #15:#16        // "<init>":()V
      #26 = Utf8               Outer$Inner
      #27 = NameAndType        #15:#35        // "<init>":(LOuter;LOuter$1;)V
      #28 = NameAndType        #13:#14        // innerInstance:LOuter$Inner;
      #29 = NameAndType        #36:#37        // b:I
      #30 = NameAndType        #38:#39        // access$102:(LOuter$Inner;I)I
      #31 = NameAndType        #40:#41        // access$100:(LOuter$Inner;)I
      #32 = Utf8               Outer
      #33 = Utf8               java/lang/Object
      #34 = Utf8               Outer$1
      #35 = Utf8               (LOuter;LOuter$1;)V
      #36 = Utf8               b
      #37 = Utf8               I
      #38 = Utf8               access$102
      #39 = Utf8               (LOuter$Inner;I)I
      #40 = Utf8               access$100
      #41 = Utf8               (LOuter$Inner;)I
    ```
    (please note that it creates a fake getter and a fake setter method entries for the `private` inner variable `a`). 
    
    If you compile the same class with `janinoc`, instead, the constant pool will be:
    ```
    Constant pool:
       #1 = Utf8               Outer
       #2 = Class              #1             // Outer
       #3 = Utf8               java/lang/Object
       #4 = Class              #3             // java/lang/Object
       #5 = Utf8               SourceFile
       #6 = Utf8               Outer.java
       #7 = Utf8               outerMethod$
       #8 = Utf8               (LOuter;)V
       #9 = Utf8               innerInstance
      #10 = Utf8               LOuter$Inner;
      #11 = NameAndType        #9:#10         // innerInstance:LOuter$Inner;
      #12 = Fieldref           #2.#11         // Outer.innerInstance:LOuter$Inner;
      #13 = Utf8               Outer$Inner
      #14 = Class              #13            // Outer$Inner
      #15 = Utf8               b
      #16 = Utf8               I
      #17 = NameAndType        #15:#16        // b:I
      #18 = Fieldref           #14.#17        // Outer$Inner.b:I
      #19 = Utf8               a
      #20 = NameAndType        #19:#16        // a:I
      #21 = Fieldref           #14.#20        // Outer$Inner.a:I
      #22 = Utf8               LineNumberTable
      #23 = Utf8               Code
      #24 = Utf8               outerMethod2$
      #25 = Utf8               (LOuter;)Z
      #26 = Utf8               <init>
      #27 = Utf8               ()V
      #28 = NameAndType        #26:#27        // "<init>":()V
      #29 = Methodref          #4.#28         // java/lang/Object."<init>":()V
      #30 = NameAndType        #26:#8         // "<init>":(LOuter;)V
      #31 = Methodref          #14.#30        // Outer$Inner."<init>":(LOuter;)V
      #32 = Utf8               Inner
      #33 = Utf8               InnerClasses
    ```
    (note that `a` now is considered as a regular field).
    
    Thus in all our tests we should use `janinoc` instead of `javac` to evaluate the behavior of the constant pool in the different cases.
    
    Let me know if you have any question or doubt. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    The PR looks good.
    
    Although it can solve the constant pool pressure, however, I have a question. Does it mean every time a constant falling in short range is used in codes, we increase 1 byte in bytecodes because sipush is followed by two byte value. I'm afraid it may be negative to some cases like many short constants in a java program.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    In int case, the followings are the number of java byte code length for a value
    
    1 byte: -1, 0, 1, 2, 3, 4, 5 (iconst_?)
    2 byte: -128 ~ -2, 6 ~ 127 (bipush)
    3 bytes: -32768 ~ -129, 128 ~ 32767 (sipush)
    4 or 5 bytes: others (ldc or ldc_w)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Btw, can we config the maximum number of global variables?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Here is [a PR](https://github.com/janino-compiler/janino/pull/34) to fix a problem regarding `sipush`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array using janinoc for target Java file.  
    The performance is not much different from the previous one. In summary, the followings are performance results (**small number is better**).
    - 1: array
    - 0.90: inner global variables
    - 0.81: flat global variables
    
    WDYT? Any comments are very appreciated.
    
    Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
    
    ```
    $ cat /proc/cpuinfo | grep "model name" | uniq
    model name	: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    $ java -version
    openjdk version "1.8.0_131"
    OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
    $ python myInstance.py > MyInstance.java && janinoc MyInstance.java  && javac Test.java && java -Xmx16g Test
    
    Result(us): Array
       0: 484251.446
       1: 483374.255
       2: 483956.692
       3: 482498.241
       4: 483602.261
       5: 482654.567
       6: 482896.671
       7: 483458.625
       8: 483194.317
       9: 483387.234
      10: 484103.729
      11: 483536.493
      12: 483790.828
      13: 483590.991
      14: 483993.488
      15: 483455.164
      16: 484040.009
      17: 483225.837
      18: 483126.520
      19: 484105.989
      20: 484988.935
      21: 483766.245
      22: 483667.930
      23: 483271.499
      24: 483071.606
      25: 483174.438
      26: 483602.474
      27: 483210.405
      28: 483907.061
      29: 483071.964
    BEST: 482498.241000, AVG: 483532.530
    
    Result(us): InnerVars
       0: 437016.533
       1: 436125.481
       2: 436360.534
       3: 435857.758
       4: 436166.243
       5: 437089.913
       6: 436168.359
       7: 435570.397
       8: 435550.848
       9: 435256.088
      10: 435252.679
      11: 435765.156
      12: 435646.739
      13: 437303.993
      14: 435315.530
      15: 435752.545
      16: 434857.606
      17: 436776.190
      18: 435444.877
      19: 435657.649
      20: 436248.147
      21: 436322.998
      22: 437214.262
      23: 435907.223
      23: 435907.223
      24: 435431.025
      25: 435274.317
      26: 435412.202
      27: 435670.321
      28: 436494.045
      29: 436347.838
    BEST: 434857.606, AVG: 435975.250
    
    Result(us): Vars
       0: 353983.048
       1: 354067.690
       2: 353138.178
       3: 354093.115
       4: 354067.180
       5: 352750.571
       6: 353672.510
       7: 355179.115
       8: 353296.750
       9: 354522.113
      10: 355221.301
      11: 355178.172
      12: 353859.319
      13: 353539.817
      14: 352703.352
      15: 353923.981
      16: 354442.744
      17: 355523.145
      18: 354849.122
      19: 354082.888
      20: 354673.504
      21: 355526.218
      22: 355264.029
      23: 355455.492
      24: 355520.322
      25: 353923.520
      26: 353796.600
      27: 355021.849
      28: 355800.387
      29: 353810.567
    BEST: 352703.352, AVG: 354362.887
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    > Each access to an array element requires one constant pool entry.
    
    what do you mean by this? using array can't reduce constant pool size at all?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @mgaido91 Thank you for your questions.
    1. I am using `javac` as shown. I am sorry that I cannot understand what you are pointing out. In this benchmark, what are differences between `javac` and `janinoc`?
    2. I agree with you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Good new is [the issue](https://github.com/janino-compiler/janino/issues/33) in janino has been quickly fixed. Bad new is no official date to release the next version now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    I like the latest @kiszk hybrid idea in terms of performance and readability. Also, this is a corner case, so  I don't want affect most regular small queries.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    First of all, I have to share sad news with you. `janino` does not use `sipush` for values from 128 to 32767. Current `janino` 3.0.7 uses `iconst...`, `bipush`, or `ldc`. `javac` uses `sipush` for values from 128 to 32767. In other words, if index is greater than 127, one constant pool is used by bytecode compiled by `janino`. It should be fixed.
    
    ```
    public class Array {
      int[] a = new int[1000000];
    
      void access() {
        a[5] = 0;
        a[6] = 0;
        a[127] = 0;
        a[128] = 0;
        a[1023] = 0;
        a[16383] = 0;
        a[32767] = 0;
        a[32768] = 0;
      }
    
      static public void main(String[] argv) {
        Array a = new Array();
        a.access();
      }
    }
    ```
    
    ```
      void access();
        descriptor: ()V
        flags:
        Code:
          stack=3, locals=1, args_size=1
             0: aload_0
             1: getfield      #12                 // Field a:[I
             4: iconst_5
             5: iconst_0
             6: iastore
             7: aload_0
             8: getfield      #12                 // Field a:[I
            11: bipush        6
            13: iconst_0
            14: iastore
            15: aload_0
            16: getfield      #12                 // Field a:[I
            19: bipush        127
            21: iconst_0
            22: iastore
            23: aload_0
            24: getfield      #12                 // Field a:[I
            27: ldc           #13                 // int 128
            29: iconst_0
            30: iastore
            31: aload_0
            32: getfield      #12                 // Field a:[I
            35: ldc           #14                 // int 1023
            37: iconst_0
            38: iastore
            39: aload_0
            40: getfield      #12                 // Field a:[I
            43: ldc           #15                 // int 16383
            45: iconst_0
            46: iastore
            47: aload_0
            48: getfield      #12                 // Field a:[I
            51: ldc           #16                 // int 32767
            53: iconst_0
            54: iastore
            55: aload_0
            56: getfield      #12                 // Field a:[I
            59: ldc           #17                 // int 32768
            61: iconst_0
            62: iastore
            63: return
    
    Constant pool:
       #1 = Utf8               Array
       #2 = Class              #1             // Array
       #9 = Utf8               a
      #10 = Utf8               [I
      #11 = NameAndType        #9:#10         // a:[I
      #12 = Fieldref           #2.#11         // Array.a:[I
    
      #13 = Integer            128
      #14 = Integer            1023
      #15 = Integer            16383
      #16 = Integer            32767
      #17 = Integer            32768
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @mgaido91 Thank you very much. I did not know such a difference. I will validate with `janinoc` 3.0.0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    In #19811, first, I will take a hybrid approach of outer (flat) global variable and arrays. The threshold for # of global variables would be configurable.
    
    I will give high priority to primitive variables to place them at the outer class due to performance.
    
    > I think it would be great if we were able to declare variables which are used only in an inner class in that inner class. Unfortunately, I think also that this is not trivial to achieve at all.
    It would be great if we could do this. For now, #19811 will not address this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
     @kiszk @maropu any of you wanna take this over? This patch becomes important as we now split codes more aggressively.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @bdrillard since my PR and other get merged now there are some conflicts, may you please fix them? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @mgaido91 Thanks. I looked at the constant pool you posted. It's clear.
    
    Any benefit to declare the variables in the inner classes? Looks like they still occupy constant pool entries?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    > I think ldc is 2 bytes and ldc_w is 3 bytes?
    You are right, thanks, updated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    ping @bdrillard 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    OK, I will create a new PR


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    >  4 or 5 bytes: others (ldc or ldc_w)
    
    I think ldc is 2 bytes and ldc_w is 3 bytes?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    yea, ok @kiszk I'll review your work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Thank you for creating a PR for the latest Spark.
    
    I think that it is great to reduce # of constant pool entries. I have one high level comment.  
    IIUC, this PR **always** perform mutable state compaction. In other words, mutable states are in arrays.  
    I am afraid about possible performance degradation due to increasing access cost by putting states in arrays.
    
    What do you think about putting mutable states into arrays (i.e. performing mutable state compaction) only when there are many mutable states or only for certain mutable states that are rarely accessed?  
    Or, can we say there is no performance degradation due to mutable state compaction?
    
    What do you think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Based on performance results and usage of constant pool entry, I would like to use hybrid approach with flat global variable and array.  
    For example, first 500 variables are stored into flat global variables, then others are stored into arrays with 32767 elements. I think that most of non-extreme cases can enjoy simple code without array accesses and good performance.
    
    WDYT?
    
    ```
    class Foo {
      int globalVars1;
      int globalVars2;
      ...
      int globalVars499;
      int globalVars500;
      int[] globalArrays1 = new int[32767];
      int[] globalArrays2 = new int[32767];
      int[] globalArrays3 = new int[32767];
      ...
    
      void apply1(InternalRow i) {
        globalVars1 = 1;
        globalVars2 = 1;
        ...
        globalVars499 = 1;
        globalVars500 = 1;
      }
    
      void apply2(InternalRow i) {
        globalArrays1[0] = 1;
        globalArrays1[1] = 1;
        ...
      }
    
      void apply(InternalRow i) {
        apply0(i);
        apply1(i);
        apply2(i);
        ...
      }
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org