You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by bdrillard <gi...@git.apache.org> on 2017/10/17 18:20:08 UTC

[GitHub] spark pull request #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

GitHub user bdrillard opened a pull request:

    https://github.com/apache/spark/pull/19518

    [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State Compaction

    ## What changes were proposed in this pull request?
    
    This PR is the part two followup to #18075, meant to address [SPARK-18016](https://github.com/apache/spark/pull/SPARK-18016), Constant Pool limit exceptions. Part 1 implemented `NestedClass` code splitting, in which excess code was split off into nested private sub-classes of the `OuterClass`. In Part 2 we address excess mutable state, in which the number of inlined variables declared at the top of the `OuterClass` can also exceed the constant pool limit. 
    
    Here, we modify the `addMutableState` function in the `CodeGenerator` to check if the declared state can be easily initialized compacted into an array and initialized in loops rather than inlined and initialized with its own line of code. We identify four types of state that can compacted:
    
    * Primitive state (ints, booleans, etc)
    * Object state of like-type without any initial assignment
    * Object state of like-type initialized to `null`
    * Object state of like-type initialized to the type's base (no-argument) constructor
    
    With mutable state compaction, at the top of the class we generate array declarations like:
    
    ```
    private Object[] references;
    private UnsafeRow result;
    private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
    private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
      ...
    private boolean[] mutableStateArray1 = new boolean[12507];
    private InternalRow[] mutableStateArray4 = new InternalRow[5268];
    private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] mutableStateArray5 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[7663];
    private java.lang.String[] mutableStateArray2 = new java.lang.String[12477];
    private int[] mutableStateArray = new int[42509];
    private java.lang.Object[] mutableStateArray6 = new java.lang.Object[30];
    private boolean[] mutableStateArray3 = new boolean[10536];
    ```
    
    and these arrays are initialized in loops as:
    
    ```
    private void init_3485() {
        for (int i = 0; i < mutableStateArray3.length; i++) {
            mutableStateArray3[i] = false;
        }
    }
    ```
    
    For compacted mutable state, `addMutableState` returns an array accessor value, which is then referenced in the subsequent generated code.
    
    **Note**: some state cannot be easily compacted (except without perhaps deeper changes to generating code), as some state value names are taken for granted at the global level during code generation (see `CatalystToExternalMap` in `Objects` as an example). For this state, we provide an `inline` hint to the function call, which indicates that the state should be inlined to the `OuterClass`. Still, the state we can easily compact manages to reduce the Constant Pool to an tractable size for the wide/deeply nested schemas I was able to test against.
    
    ## How was this patch tested?
    
    Tested against several complex schema types, also added a test case generating 40,000 string columns and creating the `UnsafeProjection`. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdrillard/spark state_compaction

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19518
    
----
commit 081bc5de6ee55e00ff58c4abddc347f77c29d4aa
Author: ALeksander Eskilson <al...@cerner.com>
Date:   2017-10-17T14:06:12Z

    adding state compaction

commit e7046c3d3bb528f18b3183d81e8bc26720a8baf7
Author: ALeksander Eskilson <al...@cerner.com>
Date:   2017-10-17T16:54:54Z

    adding inline changes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    | what do you mean by this? using array can't reduce constant pool size at all?
    Not at all. However, when array index for an access is greater than 32768, the access requires constant pool entry. This is because integer constant of 32768 or greater uses `ldc` java bytecode instruction [[ref]](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.ldc)[[ref]](https://cs.au.dk/~mis/dOvs/jvmspec/ref-_ldc.html).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    I created and ran another synthetic benchmark program for comparing flat global variables, inner global variables, and array.  In summary, the followings are performance results (**small number is better**).
    - 0.65: flat global variables
    - 0.91: inner global variables
    - 1: array
    
    WDYT? Any comments are very appreciated.
    
    Here are [Test.java](https://gist.github.com/kiszk/63c2829488cb777d7ca78d45d20c021f) and [myInsntance.py](https://gist.github.com/kiszk/049a62f5d1259481c400a86299bd0228) that I used.
    
    ```
    $ cat /proc/cpuinfo | grep "model name" | uniq
    model name	: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    $ java -version
    openjdk version "1.8.0_131"
    OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
    OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
    $ python myInstance.py > MyInstance.java && javac Test.java && java Test
    
    Result(us): Array
       0: 462848.969
       1: 461978.693
       2: 463174.459
       3: 461422.763
       4: 460563.915
       5: 460112.262
       6: 460059.957
       7: 460376.230
       8: 460245.445
       9: 460308.775
      10: 460154.955
      11: 460005.629
      12: 460330.584
      13: 460277.612
      14: 460181.360
      15: 460168.843
      16: 459790.137
      17: 460248.481
      18: 460344.471
      19: 460084.529
      20: 459987.263
      21: 459961.639
      22: 459952.447
      23: 460128.518
      24: 460025.783
      25: 459874.303
      26: 459932.685
      27: 460065.736
      28: 459954.526
      29: 459972.679
    BEST: 459790.137000, AVG: 460417.788
    
    Result(us): InnerVars
       0: 421013.480
       1: 420279.235
       2: 419366.157
       3: 421015.934
       4: 419540.049
       5: 420316.650
       6: 419816.612
       7: 420211.140
       8: 420215.864
       9: 421104.657
      10: 421836.430
      11: 420866.894
      12: 421457.850
      13: 421734.506
      14: 420796.010
      15: 419832.910
      16: 420012.167
      17: 420821.800
      18: 420962.178
      19: 421981.676
      20: 421721.257
      21: 419996.594
      22: 419742.884
      23: 420158.066
      24: 420156.773
      25: 420325.231
      26: 420966.914
      27: 420787.147
      28: 420296.789
      29: 420520.843
    BEST: 419366.157, AVG: 420595.157
    
    Result(us): Vars
       0: 343490.797
       1: 342849.079
       2: 341990.967
       3: 342844.044
       4: 343484.681
       5: 342586.419
       6: 342468.883
       7: 343113.300
       8: 343516.875
       9: 343002.395
      10: 341499.538
      11: 342192.102
      12: 341847.383
      13: 342533.215
      14: 341376.556
      15: 342018.111
      16: 341316.445
      17: 342043.378
      18: 341969.932
      19: 343415.854
      20: 343103.133
      21: 342084.686
      22: 341555.293
      23: 342984.355
      24: 342302.336
      25: 341994.372
      26: 342475.639
      27: 342281.214
      28: 342205.175
      29: 342462.032
    BEST: 341316.445, AVG: 342433.606
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.

Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk Ah, thanks for the link back to that discussion. I'll make modifications to the trials for better data.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    can we use `32767` as array size upper bound? e.g.
    ```
    class Foo {
      int[] globalVars1 = new int[32767];
      int[] globalVars2 = new int[32767];
      int[] globalVars3 = new int[32767];
      ...
    
      void apply0(InternalRow i) {
        globalVars1[0] = 1;
        globalVars1[1] = 1;
        ...
      }
      void apply1(InternalRow i) {
        globalVars2[0] = 1;
        globalVars2[1] = 1;
        ...
      }
    
      void apply(InternalRow i) {
        apply0(i);
        apply1(i);
        ...
      }
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @viirya in general there is no benefit. There will be a benefit if we manage to declare them where they are used, but I am not sure this is feasible. In this way, they do not add any entry to the constant pool.
    
    For instance, if we have a inner class `InnerClass1` and we use `isNull_11111` only there, if we define `isNull_11111` as a variable of `InnerClass1` instead of a variable of the outer class we have no entry about it in the outer class.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    Next, I analyzed usage of constant pool entries and java bytecode ops using `janinoc`. The summary is as follows:
    ```
    array[4]     : 6 + 0 * n entries, 6-8 java bytecode ops / read access
    outerInstance: 3 + 3 * n entries, 5 java bytecode ops / read access
    innerInsntace: 9 + 3 * n entries, 6 java bytecode ops / read access
    ```
    
    ```
    public class CP {
      int[] a = new int[1000000];
      int globalVar0;
      int globalVar1;
      private Inner inner = new Inner();
    
      private class Inner {
        int nestedVar0;
        int nestedVar1;
      }
    
      void access() {
        a[4] = 0;
        a[5] = 0;
    
        globalVar0 = 0;
        globalVar1 = 0;
    
        inner.nestedVar0 = 0;
        inner.nestedVar1 = 0;
      }
    
      static public void main(String[] argv) {
        CP cp = new CP();
        cp.access();
      }
    }
    ```
    
    Java bytecode
    ```
      void access();
        descriptor: ()V
        Code:
          stack=3, locals=1, args_size=1
             0: aload_0
             1: getfield      #12                 // Field a:[I
             4: iconst_4
             5: iconst_0
             6: iastore
             7: aload_0
             8: getfield      #12                 // Field a:[I
            11: iconst_5
            12: iconst_0
            13: iastore
    
            14: aload_0
            15: iconst_0
            16: putfield      #16                 // Field globalVar0:I
            19: aload_0
            20: iconst_0
            21: putfield      #19                 // Field globalVar1:I
    
            24: aload_0
            25: getfield      #23                 // Field inner:LCP$Inner;
            28: iconst_0
            29: putfield      #28                 // Field CP$Inner.nestedVar0:I
            32: aload_0
            33: getfield      #23                 // Field inner:LCP$Inner;
            36: iconst_0
            37: putfield      #31                 // Field CP$Inner.nestedVar1:I
            40: return
    ```
    
    Constant pool
    ```
       #1 = Utf8               CP
       #2 = Class              #1             // CP
       #9 = Utf8               a
      #10 = Utf8               [I
      #11 = NameAndType        #9:#10         // a:[I
      #12 = Fieldref           #2.#11         // CP.a:[I
    
      #13 = Utf8               globalVar0
      #14 = Utf8               I
      #15 = NameAndType        #13:#14        // globalVar0:I
      #16 = Fieldref           #2.#15         // CP.globalVar0:I
    
      #17 = Utf8               globalVar1
      #18 = NameAndType        #17:#14        // globalVar1:I
      #19 = Fieldref           #2.#18         // CP.globalVar1:I
    
      #20 = Utf8               inner
      #21 = Utf8               LCP$Inner;
      #22 = NameAndType        #20:#21        // inner:LCP$Inner;
      #23 = Fieldref           #2.#22         // CP.inner:LCP$Inner;
    
      #24 = Utf8               CP$Inner
      #25 = Class              #24            // CP$Inner
      #26 = Utf8               nestedVar0
      #27 = NameAndType        #26:#14        // nestedVar0:I
      #28 = Fieldref           #25.#27        // CP$Inner.nestedVar0:I
      
      #29 = Utf8               nestedVar1
      #30 = NameAndType        #29:#14        // nestedVar1:I
      #31 = Fieldref           #25.#30        // CP$Inner.nestedVar1:I
    
      #32 = Utf8               LineNumberTable
      #33 = Utf8               Code
      #34 = Utf8               main
      #35 = Utf8               ([Ljava/lang/String;)V
      #36 = Utf8               <init>
      #37 = NameAndType        #36:#8         // "<init>":()V
      #38 = Methodref          #2.#37         // CP."<init>":()V
      #39 = NameAndType        #7:#8          // access:()V
      #40 = Methodref          #2.#39         // CP.access:()V
      #41 = Methodref          #4.#37         // java/lang/Object."<init>":()V
      #42 = Integer            1000000
      #43 = Utf8               (LCP;)V
      #44 = NameAndType        #36:#43        // "<init>":(LCP;)V
      #45 = Methodref          #25.#44        // CP$Inner."<init>":(LCP;)V
      #46 = Utf8               Inner
      #47 = Utf8               InnerClasses
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @cloud-fan Is it better to use this PR? Or, create a new PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org