You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by bdrillard <gi...@git.apache.org> on 2017/05/23 20:41:35 UTC

[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

GitHub user bdrillard opened a pull request:

    https://github.com/apache/spark/pull/18075

    [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - Class Splitting

    ## What changes were proposed in this pull request?
    
    This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes:
    
    1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes.
    2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used.
    3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class.
    
    ## How was this patch tested?
    
    Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool. 
    
    Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdrillard/spark class_splitting_only

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18075.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18075
    
----
commit a73fdded9ea15bbe66f5f86ca158bb76cdf79033
Author: ALeksander Eskilson <al...@cerner.com>
Date:   2017-05-23T19:41:21Z

    class_splitting_only adding class splitting

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121836076
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val (className, classInstance) = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +
    +    classSize(className) += funcCode.length
    +    classFunctions(className) += funcName -> funcCode
    +
    +    if (className.equals("OuterClass")) {
    --- End diff --
    
    nit: `className == "OuterClass"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122016650
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -629,7 +730,9 @@ class CodegenContext {
     
       /**
        * Splits the generated code of expressions into multiple functions, because function has
    -   * 64kb code size limit in JVM
    +   * 64kb code size limit in JVM. If the class the function is to be inlined to would grow beyond
    --- End diff --
    
    I think this is the grammatically correct/hopefully more clear form of that same docstring: https://github.com/apache/spark/pull/18075/commits/678b4ad770cbc891864f73d9b93c51b1ab79f6a5#diff-8bcc5aea39c73d4bf38aef6f6951d42cR727


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122356982
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,118 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  val outerClassName = "OuterClass"
    +
    +  /**
    +   * Holds the class and instance names to be generated, where `OuterClass` is a placeholder
    +   * standing for whichever class is generated as the outermost class and which will contain any
    +   * nested sub-classes. All other classes and instance names in this list will represent private,
    +   * nested sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](outerClassName -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](outerClassName -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(outerClassName -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    --- End diff --
    
    ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #78059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78059/testReport)** for PR 18075 at commit [`678b4ad`](https://github.com/apache/spark/commit/678b4ad770cbc891864f73d9b93c51b1ab79f6a5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #77564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77564/testReport)** for PR 18075 at commit [`493113c`](https://github.com/apache/spark/commit/493113ce2e1271039701be35b2603271282111df).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118345866
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    Yeah, good point, since we're using a mutable buffer, we can update the referenced object directly even if its contained inside the map. Since `+=` explicitly returns the reference to the modified buffer, it would probably be most straightforward to use
    `classFunctions(name).append(funcCode)`
    since `append` has `Unit` return type, and we don't need any results from appending the code to the class's buffer of function code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #78059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78059/testReport)** for PR 18075 at commit [`678b4ad`](https://github.com/apache/spark/commit/678b4ad770cbc891864f73d9b93c51b1ab79f6a5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #77597 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77597/testReport)** for PR 18075 at commit [`7fe5e4a`](https://github.com/apache/spark/commit/7fe5e4a84d4d8e71e2e63e6794e4ba13ac2e003f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121837792
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -93,7 +93,7 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport {
         }
     
         val nextBatch = ctx.freshName("nextBatch")
    --- End diff --
    
    actually, how about we make `addNewFunction` accepts `funcNameHint` and generate unique func name inside `addNewFunction`? Then users can just call: `val nextBatch = ctx.addNewFunction("nextBatch", ...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121837371
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -83,6 +83,58 @@ class GeneratedProjectionSuite extends SparkFunSuite {
         assert(result === row2)
       }
     
    +  test("SPARK-18016: generated projections on wider table requiring class-splitting") {
    +    val N = 4000
    +    val wideRow1 = new GenericInternalRow((1 to N).toArray[Any])
    +    val schema1 = StructType((1 to N).map(i => StructField("", IntegerType)))
    +    val wideRow2 = new GenericInternalRow(
    +      (1 to N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
    +    val schema2 = StructType((1 to N).map(i => StructField("", StringType)))
    +    val joined = new JoinedRow(wideRow1, wideRow2)
    +    val joinedSchema = StructType(schema1 ++ schema2)
    +    val nested = new JoinedRow(InternalRow(joined, joined), joined)
    +    val nestedSchema = StructType(
    +      Seq(StructField("", joinedSchema), StructField("", joinedSchema)) ++ joinedSchema)
    +
    +    // test generated UnsafeProjection
    +    val unsafeProj = UnsafeProjection.create(nestedSchema)
    +    val unsafe: UnsafeRow = unsafeProj(nested)
    +    (0 until N).foreach { i =>
    +      val s = UTF8String.fromString((i + 1).toString)
    --- End diff --
    
    create input data with `0 until N`, or test it with `1 to N`, to avoid this `i + 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121989572
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -93,7 +93,7 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport {
         }
     
         val nextBatch = ctx.freshName("nextBatch")
    --- End diff --
    
    I suppose it depends on which implementation we think is cleaner. The freshName generated by the caller is typically used twice, once in the call to `addNewFunction`, but also immediately in the function code as the method name. If we use a name hint, we'd have to do a string `replace` inside `addNewFunction` to update the placeholder method name with the freshname. So it would seem either we keep
    
    ```
        val nextBatch = ctx.freshName("nextBatch")
        val nextBatchFuncName = ctx.addNewFunction(nextBatch,
          s"""
             |private void $nextBatch() throws java.io.IOException {
             |  long getBatchStart = System.nanoTime();
             |  if ($input.hasNext()) {
             |    $batch = ($columnarBatchClz)$input.next();
             |    $numOutputRows.add($batch.numRows());
             |    $idx = 0;
             |    ${columnAssigns.mkString("", "\n", "\n")}
             |  }
             |  $scanTimeTotalNs += System.nanoTime() - getBatchStart;
             |}""".stripMargin)
    ```
    
    or we have
    
    ```
        val nextBatchHint = "nextBatch"
        val nextBatch = ctx.addNewFunction(nextBatchHint,
          s"""
             |private void $nextBatchHint() throws java.io.IOException {
             |  long getBatchStart = System.nanoTime();
             |  if ($input.hasNext()) {
             |    $batch = ($columnarBatchClz)$input.next();
             |    $numOutputRows.add($batch.numRows());
             |    $idx = 0;
             |    ${columnAssigns.mkString("", "\n", "\n")}
             |  }
             |  $scanTimeTotalNs += System.nanoTime() - getBatchStart;
             |}""".stripMargin)
    ```
    
    where `addNewFunction` would do the proper replacement over the code for the method with a freshname generated from "nextBatch" as a name hint.
    
    Or in every instance, we just duplicate the string hint without creating a variable for it in both the `addNewFunction` call and the method name:
    
    ```
        val nextBatch = ctx.addNewFunction("nextBatch",
          s"""
             |private void nextBatch() throws java.io.IOException {
             |  long getBatchStart = System.nanoTime();
             |  if ($input.hasNext()) {
             |    $batch = ($columnarBatchClz)$input.next();
             |    $numOutputRows.add($batch.numRows());
             |    $idx = 0;
             |    ${columnAssigns.mkString("", "\n", "\n")}
             |  }
             |  $scanTimeTotalNs += System.nanoTime() - getBatchStart;
             |}""".stripMargin)
    ```
    
    Which would you prefer? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122016978
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -83,6 +83,58 @@ class GeneratedProjectionSuite extends SparkFunSuite {
         assert(result === row2)
       }
     
    +  test("SPARK-18016: generated projections on wider table requiring class-splitting") {
    +    val N = 4000
    +    val wideRow1 = new GenericInternalRow((1 to N).toArray[Any])
    --- End diff --
    
    I've cleaned up the test you comment on, and the one above it, since they both have the same structure, just different values for N:
    https://github.com/apache/spark/pull/18075/commits/678b4ad770cbc891864f73d9b93c51b1ab79f6a5#diff-a14107cf4a4c41671bba24a82f6042d9R36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77597/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by fbertsch <gi...@git.apache.org>.
Github user fbertsch commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    We're really looking forward to this change! This bug is limiting a lot of the work we'd like to do with Spark. Any idea who we can ping to move this along?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    The earlier failure occurred when the [`stopEarly()`](https://github.com/bdrillard/spark/blob/7fe5e4a84d4d8e71e2e63e6794e4ba13ac2e003f/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L73) function is registered to the `OuterClass` potentially twice. Using a [Map of functions](https://github.com/apache/spark/pull/18075/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR239) holding the function code _and_ its name fixes the issue, as whenever a function of the same name would be added more than once, it updates the older value. The tests pass after the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118263506
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -792,7 +887,18 @@ class CodegenContext {
           addMutableState(javaType(expr.dataType), value,
             s"$value = ${defaultValue(expr.dataType)};")
     
    -      subexprFunctions += s"$fnName($INPUT_ROW);"
    +      // Generate the code for this expression tree and wrap it in a function.
    --- End diff --
    
    In the original pull-request, since `addMutableState` returned values that were going to be referenced in `fn`, I had to declare the mutable state accessors before the code. It's less important here. The really important part for this pull-request is that the return value of `addNewFunction(fnName, fn)` is given added to `subexprFunctions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    thanks, merging to master! you can address the remaining comments in your other PRs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121837140
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -83,6 +83,58 @@ class GeneratedProjectionSuite extends SparkFunSuite {
         assert(result === row2)
       }
     
    +  test("SPARK-18016: generated projections on wider table requiring class-splitting") {
    +    val N = 4000
    +    val wideRow1 = new GenericInternalRow((1 to N).toArray[Any])
    --- End diff --
    
    nit: `new GenericInternalRow(N)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r119180019
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala ---
    @@ -190,15 +190,15 @@ case class Stack(children: Seq[Expression]) extends Generator {
             if (index < values.length) values(index) else Literal(null, dataTypes(col))
           }
           val eval = CreateStruct(fields).genCode(ctx)
    -      s"${eval.code}\nthis.$rowData[$row] = ${eval.value};"
    +      s"${eval.code}\n$rowData[$row] = ${eval.value};"
         })
     
         // Create the collection.
         val wrapperClass = classOf[mutable.WrappedArray[_]].getName
         ctx.addMutableState(
           s"$wrapperClass<InternalRow>",
           ev.value,
    -      s"this.${ev.value} = $wrapperClass$$.MODULE$$.make(this.$rowData);")
    +      s"this.${ev.value} = $wrapperClass$$.MODULE$$.make($rowData);")
    --- End diff --
    
    Good catch, fixed: https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-16493d6958b6daaf4a24dd7b780ba4bcL201


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118336805
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    Ah, sure. The story seems pretty much the same, we still want the append operation that also returns the reference to the modified buffer, and that's given by `+=`. Also, it doesn't look like `+` is defined as an operation on a `ListBuffer[A]` and an element of type `A` (see [ListBuffer](https://www.scala-lang.org/api/2.11.8/index.html#scala.collection.mutable.ListBuffer)).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118302508
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
       }
     
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/78bccda7394db04f26ff2e8c14a2e2b5d8ea57cc#diff-8bcc5aea39c73d4bf38aef6f6951d42cL306


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Thank you. Absolutely, it is easier to review this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118872149
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions(name).append(funcCode)
    --- End diff --
    
    How about:
    
    ```scala
        classSize(name) += funcCode.length
        classFunctions(name) += funcCode
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118300347
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ---
    @@ -299,6 +297,9 @@ case class SampleExec(
               | }
              """.stripMargin.trim)
     
    +      ctx.addMutableState(s"$samplerClass<UnsafeRow>", sampler,
    --- End diff --
    
    Can we put this block at the original place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18075


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122016536
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val (className, classInstance) = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +
    +    classSize(className) += funcCode.length
    +    classFunctions(className) += funcName -> funcCode
    +
    +    if (className.equals("OuterClass")) {
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/678b4ad770cbc891864f73d9b93c51b1ab79f6a5#diff-8bcc5aea39c73d4bf38aef6f6951d42cR296


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77564/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122016799
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val (className, classInstance) = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +
    +    classSize(className) += funcCode.length
    +    classFunctions(className) += funcName -> funcCode
    +
    +    if (className.equals("OuterClass")) {
    +      funcName
    +    } else {
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline to the OuterClass.
    +    classes.map {
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/678b4ad770cbc891864f73d9b93c51b1ab79f6a5#diff-8bcc5aea39c73d4bf38aef6f6951d42cR327


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    The second part that follows this merged PR is up as #19518.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Thanks, sound good to me for now.
    cc @ueshin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121975114
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    --- End diff --
    
    Yes, see the portion of `doConsume` in the Limit class where the `stopEarly` function is registered, https://github.com/apache/spark/pull/18075/files#diff-379cccace8699ca00b76ff5631222adeR73
    
    In this section of code, the registration of the function is separate from the caller code, so unlike other changes in this patch, we have no way of informing the caller code what the potentially class-qualified name of the function would be if it were inlined to a nested class. Instead, the caller code for the function, which is generated some point later, makes a hard assumption that `stopEarly` will be visible globally, that is, in the outer class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118168255
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,129 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding all functions and their names that will be inlined to a given class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(("OuterClass", mutable.Map.empty[String, String]))
    +
    +  // Returns the sie of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcName -> funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    +    classes.map {
    +      case (className, classInstance) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          s"private $className $classInstance = new $className();"
    +        }
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all functions that should be inlined to the `OuterClass`
    +   */
    +  private[sql] def declareAddedFunctions(): String = {
    +    classFunctions("OuterClass").map {
    +      case (funcName, funcCode) => funcCode
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all nested, private sub-classes and functions that should be inlined to them.
    +   */
    +  private[sql] def declareNestedClasses(): String = {
    +    classFunctions.map {
    +      case (className, functions) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          val code = functions.map {
    +            case (_, funcCode) =>
    +              s"$funcCode"
    --- End diff --
    
    nit: s"$funcCode" -> funcCode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121835500
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    --- End diff --
    
    we can create a variable for this `"OuterClass"` instead of hardcoding it in many places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118320606
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    I am not 100% sure, but is " += " required? How about " = "?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118168228
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,129 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding all functions and their names that will be inlined to a given class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(("OuterClass", mutable.Map.empty[String, String]))
    +
    +  // Returns the sie of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcName -> funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    +    classes.map {
    +      case (className, classInstance) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          s"private $className $classInstance = new $className();"
    +        }
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all functions that should be inlined to the `OuterClass`
    +   */
    +  private[sql] def declareAddedFunctions(): String = {
    +    classFunctions("OuterClass").map {
    +      case (funcName, funcCode) => funcCode
    --- End diff --
    
    nit: funcName -> _


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118866774
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    --- End diff --
    
    nit: How about `classes.prepend((className, classInstance))` or `classes.prepend(className -> classInstance)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118344615
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    Thank you for your clarification. 
    In that case, is `classFunctions(name) += funcCode` enough instead of calling `classFunctions.update`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118331100
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    Sorry for confusing you, I made a mistake in my comment. I wanted to say " + " instead of " = ". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118274094
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,129 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding all functions and their names that will be inlined to a given class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(("OuterClass", mutable.Map.empty[String, String]))
    +
    +  // Returns the sie of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcName -> funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    +    classes.map {
    +      case (className, classInstance) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          s"private $className $classInstance = new $className();"
    +        }
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all functions that should be inlined to the `OuterClass`
    +   */
    +  private[sql] def declareAddedFunctions(): String = {
    +    classFunctions("OuterClass").map {
    +      case (funcName, funcCode) => funcCode
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/90a907acab6a074283908c0a3784af2c7d32cbce#diff-8bcc5aea39c73d4bf38aef6f6951d42cL322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #77597 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77597/testReport)** for PR 18075 at commit [`7fe5e4a`](https://github.com/apache/spark/commit/7fe5e4a84d4d8e71e2e63e6794e4ba13ac2e003f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    @cloud-fan can you have a time to look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118274597
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -792,7 +887,18 @@ class CodegenContext {
           addMutableState(javaType(expr.dataType), value,
             s"$value = ${defaultValue(expr.dataType)};")
     
    -      subexprFunctions += s"$fnName($INPUT_ROW);"
    +      // Generate the code for this expression tree and wrap it in a function.
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/90a907acab6a074283908c0a3784af2c7d32cbce#diff-8bcc5aea39c73d4bf38aef6f6951d42cR872
    Note: a rebase with the two commits squashed will show the `subexprFunctions` as the only changed line, which is what we want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118300406
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
       }
     
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    --- End diff --
    
    nit: ot -> at


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122359345
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    --- End diff --
    
    yup, whole stage codegen is really tricky...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121838120
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    --- End diff --
    
    can you give an example? I'm not very clear when we need this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r119179700
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    --- End diff --
    
    Fixed (with some refactoring based on those variables being available in the scope earlier): https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-8bcc5aea39c73d4bf38aef6f6951d42cL278


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    **[Test build #77564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77564/testReport)** for PR 18075 at commit [`493113c`](https://github.com/apache/spark/commit/493113ce2e1271039701be35b2603271282111df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118868339
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala ---
    @@ -190,15 +190,15 @@ case class Stack(children: Seq[Expression]) extends Generator {
             if (index < values.length) values(index) else Literal(null, dataTypes(col))
           }
           val eval = CreateStruct(fields).genCode(ctx)
    -      s"${eval.code}\nthis.$rowData[$row] = ${eval.value};"
    +      s"${eval.code}\n$rowData[$row] = ${eval.value};"
         })
     
         // Create the collection.
         val wrapperClass = classOf[mutable.WrappedArray[_]].getName
         ctx.addMutableState(
           s"$wrapperClass<InternalRow>",
           ev.value,
    -      s"this.${ev.value} = $wrapperClass$$.MODULE$$.make(this.$rowData);")
    +      s"this.${ev.value} = $wrapperClass$$.MODULE$$.make($rowData);")
    --- End diff --
    
    Should remove one more `this.` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122122817
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -62,13 +62,63 @@ class GeneratedProjectionSuite extends SparkFunSuite {
         val result = safeProj(unsafe)
         // Can't compare GenericInternalRow with JoinedRow directly
         (0 until N).foreach { i =>
    -      val r = i + 1
    -      val s = UTF8String.fromString((i + 1).toString)
    -      assert(r === result.getInt(i + 2))
    +      val s = UTF8String.fromString(i.toString)
    +      assert(i === result.getInt(i + 2))
           assert(s === result.getUTF8String(i + 2 + N))
    -      assert(r === result.getStruct(0, N * 2).getInt(i))
    +      assert(i === result.getStruct(0, N * 2).getInt(i))
           assert(s === result.getStruct(0, N * 2).getUTF8String(i + N))
    -      assert(r === result.getStruct(1, N * 2).getInt(i))
    +      assert(i === result.getStruct(1, N * 2).getInt(i))
    +      assert(s === result.getStruct(1, N * 2).getUTF8String(i + N))
    +    }
    +
    +    // test generated MutableProjection
    +    val exprs = nestedSchema.fields.zipWithIndex.map { case (f, i) =>
    +      BoundReference(i, f.dataType, true)
    +    }
    +    val mutableProj = GenerateMutableProjection.generate(exprs)
    +    val row1 = mutableProj(result)
    +    assert(result === row1)
    +    val row2 = mutableProj(result)
    +    assert(result === row2)
    +  }
    +
    +  test("SPARK-18016: generated projections on wider table requiring class-splitting") {
    +    val N = 4000
    +    val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118497931
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    Here's a commit with that change if you think it checks out: https://github.com/apache/spark/pull/18075/commits/c225f3ad3b5183be6c637633b0ebffc765be9532#diff-8bcc5aea39c73d4bf38aef6f6951d42cL290


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118302075
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ---
    @@ -299,6 +297,9 @@ case class SampleExec(
               | }
              """.stripMargin.trim)
     
    +      ctx.addMutableState(s"$samplerClass<UnsafeRow>", sampler,
    --- End diff --
    
    This change should stay the way it is. Notice that the initialization code for the `addMutableState` call on the line just below is the same code passed to the `addNewFunction` call. This means that when we create the new function, we may get back a class-qualified function name (which here we store in `initSamplerFuncName`), so the call to `addMutableState` must come after the `addNewFunction` call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118165399
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -792,7 +887,18 @@ class CodegenContext {
           addMutableState(javaType(expr.dataType), value,
             s"$value = ${defaultValue(expr.dataType)};")
     
    -      subexprFunctions += s"$fnName($INPUT_ROW);"
    +      // Generate the code for this expression tree and wrap it in a function.
    --- End diff --
    
    Is there any reason to move this code block from the original place to here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121837609
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -93,7 +93,7 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport {
         }
     
         val nextBatch = ctx.freshName("nextBatch")
    --- End diff --
    
    nit: inline this `nextBatch`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122016475
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/678b4ad770cbc891864f73d9b93c51b1ab79f6a5#diff-8bcc5aea39c73d4bf38aef6f6951d42cR225


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118323003
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map(("OuterClass", mutable.ListBuffer.empty[String]))
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcCode)
    --- End diff --
    
    `+=` will be necessary since `classFunctions(name)` will return the `ListBuffer[String]` containing all functions belonging to the given class name from the `classFunctions` map, and we want to append the new funcCode to that buffer. Also, `classFunctions.update` is going to expect a `ListBuffer[String]` as its second argument, but the return type of assignment `=` is just `Unit`, but `+=` will append the element and then return the modified buffer (which is the behavior we want). See the API doc for [ListBuffer](https://www.scala-lang.org/api/2.11.8/index.html#scala.collection.mutable.ListBuffer@+=(x:A):ListBuffer.this.type).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121968338
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    --- End diff --
    
    I had originally thought so, but it turns out that there's at least one instance where the code for a given function name is updated during the code-generation process. The generated [`stopEarly`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L75) function can actually be inserted twice, once returning a variable returning a different `stopEarly` variable each time. What would end up occurring is that two functions of the same signature would exist in the class, causing a compile error. So we need to use a map to make sure the implementation gets _updated_ for a given function when necessary. Note also that the old implementation of [`addedFunctions`](https://github.com/apache/spark/pull/18075/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cL208) was a map also. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118866096
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    --- End diff --
    
    nit: We can use `val (className, classInstance) = if ...` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r119179925
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions(name).append(funcCode)
    --- End diff --
    
    I suppose that's more concise/readable given the underlying mutable data structures: https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-8bcc5aea39c73d4bf38aef6f6951d42cL292


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r119179501
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-8bcc5aea39c73d4bf38aef6f6951d42cL250


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122017129
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -83,6 +83,58 @@ class GeneratedProjectionSuite extends SparkFunSuite {
         assert(result === row2)
       }
     
    +  test("SPARK-18016: generated projections on wider table requiring class-splitting") {
    +    val N = 4000
    +    val wideRow1 = new GenericInternalRow((1 to N).toArray[Any])
    +    val schema1 = StructType((1 to N).map(i => StructField("", IntegerType)))
    +    val wideRow2 = new GenericInternalRow(
    +      (1 to N).map(i => UTF8String.fromString(i.toString)).toArray[Any])
    +    val schema2 = StructType((1 to N).map(i => StructField("", StringType)))
    +    val joined = new JoinedRow(wideRow1, wideRow2)
    +    val joinedSchema = StructType(schema1 ++ schema2)
    +    val nested = new JoinedRow(InternalRow(joined, joined), joined)
    +    val nestedSchema = StructType(
    +      Seq(StructField("", joinedSchema), StructField("", joinedSchema)) ++ joinedSchema)
    +
    +    // test generated UnsafeProjection
    +    val unsafeProj = UnsafeProjection.create(nestedSchema)
    +    val unsafe: UnsafeRow = unsafeProj(nested)
    +    (0 until N).foreach { i =>
    +      val s = UTF8String.fromString((i + 1).toString)
    --- End diff --
    
    Fixed. See the above comment. Creating the data with `0 until N` cleans up the indexing on `i`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118162963
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -629,7 +736,9 @@ class CodegenContext {
     
       /**
        * Splits the generated code of expressions into multiple functions, because function has
    -   * 64kb code size limit in JVM
    +   * 64kb code size limit in JVM. If the class the function is to be inlined to would grow beyond
    +   * 1600kb, a private, netsted sub-class is declared, and the function is inlined to it, because
    --- End diff --
    
    nit: netsted -> nested?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    LGTM except some style comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122356214
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    --- End diff --
    
    It seems to me, as the `stopEarly` in `Limit` is going to override the `stopEarly` in `BufferedRowIterator`, we can only put it in outer class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r119179568
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-8bcc5aea39c73d4bf38aef6f6951d42cL271


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118274179
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +223,129 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](("OuterClass", null))
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](("OuterClass", 0))
    +
    +  // A map holding all functions and their names that will be inlined to a given class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(("OuterClass", mutable.Map.empty[String, String]))
    +
    +  // Returns the sie of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val classInfo = if (inlineToOuterClass) {
    +      ("OuterClass", "")
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass
    +    }
    +    val name = classInfo._1
    +
    +    classSize.update(name, classSize(name) + funcCode.length)
    +    classFunctions.update(name, classFunctions(name) += funcName -> funcCode)
    +
    +    if (name.equals("OuterClass")) {
    +      funcName
    +    } else {
    +      val classInstance = classInfo._2
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline ot the OuterClass
    +    classes.map {
    +      case (className, classInstance) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          s"private $className $classInstance = new $className();"
    +        }
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all functions that should be inlined to the `OuterClass`
    +   */
    +  private[sql] def declareAddedFunctions(): String = {
    +    classFunctions("OuterClass").map {
    +      case (funcName, funcCode) => funcCode
    +    }.mkString("\n")
    +  }
    +
    +  /**
    +   * Declares all nested, private sub-classes and functions that should be inlined to them.
    +   */
    +  private[sql] def declareNestedClasses(): String = {
    +    classFunctions.map {
    +      case (className, functions) =>
    +        if (className.equals("OuterClass")) {
    +          ""
    +        } else {
    +          val code = functions.map {
    +            case (_, funcCode) =>
    +              s"$funcCode"
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/90a907acab6a074283908c0a3784af2c7d32cbce#diff-8bcc5aea39c73d4bf38aef6f6951d42cL336


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121836835
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map("OuterClass" -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
       }
     
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean = false): String = {
    +    // The number of named constants that can exist in the class is limited by the Constant Pool
    +    // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    +    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // sub-class.
    +    val (className, classInstance) = if (inlineToOuterClass) {
    +      "OuterClass" -> ""
    +    } else if (currClassSize > 1600000) {
    +      val className = freshName("NestedClass")
    +      val classInstance = freshName("nestedClassInstance")
    +
    +      addClass(className, classInstance)
    +
    +      className -> classInstance
    +    } else {
    +      currClass()
    +    }
    +
    +    classSize(className) += funcCode.length
    +    classFunctions(className) += funcName -> funcCode
    +
    +    if (className.equals("OuterClass")) {
    +      funcName
    +    } else {
    +
    +      s"$classInstance.$funcName"
    +    }
    +  }
    +
    +  /**
    +   * Instantiates all nested, private sub-classes as objects to the `OuterClass`
    +   */
    +  private[sql] def initNestedClasses(): String = {
    +    // Nested, private sub-classes have no mutable state (though they do reference the outer class'
    +    // mutable state), so we declare and initialize them inline to the OuterClass.
    +    classes.map {
    --- End diff --
    
    nit: `classes.filterKeys(_ != "OuterClass").map ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    @ueshin As for the remaining this in `objects.scala`, https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-e436c96ea839dfe446837ab2a3531f93L984
    and the need for an additional nested classes declaration in `GenerateColumnAccessor.scala`, https://github.com/apache/spark/pull/18075/commits/493113ce2e1271039701be35b2603271282111df#diff-58a69e526de8182bcb4c840a8cb29e2dR225


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121836403
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,124 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    --- End diff --
    
    seems we only need a map from class name to function codes? i.e. `mutable.Map[String, mutable.ListBuffer[String]]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122122867
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -93,7 +93,7 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport {
         }
     
         val nextBatch = ctx.freshName("nextBatch")
    --- End diff --
    
    let's keep it as it was


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    LGTM, cc @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r121836537
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -629,7 +730,9 @@ class CodegenContext {
     
       /**
        * Splits the generated code of expressions into multiple functions, because function has
    -   * 64kb code size limit in JVM
    +   * 64kb code size limit in JVM. If the class the function is to be inlined to would grow beyond
    --- End diff --
    
    `If the class the function` -> `If the class with the function`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122122765
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala ---
    @@ -33,10 +33,10 @@ class GeneratedProjectionSuite extends SparkFunSuite {
     
       test("generated projections on wider table") {
         val N = 1000
    -    val wideRow1 = new GenericInternalRow((1 to N).toArray[Any])
    +    val wideRow1 = new GenericInternalRow((0 until N).toArray[Any])
    --- End diff --
    
    nit: can be `new GenericInternalRow(N)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118865349
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,128 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  /**
    +   * Holds the class and instance names to be generated. `OuterClass` is a placeholder standing for
    +   * whichever class is generated as the outermost class and which will contain any nested
    +   * sub-classes. All other classes and instance names in this list will represent private, nested
    +   * sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)]("OuterClass" -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int]("OuterClass" -> 0)
    +
    +  // A map holding lists of functions belonging to their class.
    +  private val classFunctions: mutable.Map[String, mutable.ListBuffer[String]] =
    +    mutable.Map("OuterClass" -> mutable.ListBuffer.empty[String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(Tuple2(className, classInstance))
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.ListBuffer.empty[String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    +   * function will be returned. Otherwise, the function will be inined to the `OuterClass` the
    +   * simple `funcName` will be returned.
    +   *
    +   * @param funcName the class-unqualified name of the function
    +   * @param funcCode the body of the function
    +   * @param inlineToOuterClass whether the given code must be inlined to the `OuterClass`. This
    +   *                           can be necessary when a function is declared outside of the context
    +   *                           it is eventually referenced and a returned qualified function name
    +   *                           cannot otherwise be accessed.
    +   * @return the name of the function, qualified by class if it will be inlined to a private,
    +   *         nested sub-class
    +   */
    +  def addNewFunction(
    +    funcName: String,
    +    funcCode: String,
    +    inlineToOuterClass: Boolean = false): String = {
    --- End diff --
    
    nit: indent. Add 2 more spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by bdrillard <gi...@git.apache.org>.
Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118274669
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -629,7 +736,9 @@ class CodegenContext {
     
       /**
        * Splits the generated code of expressions into multiple functions, because function has
    -   * 64kb code size limit in JVM
    +   * 64kb code size limit in JVM. If the class the function is to be inlined to would grow beyond
    +   * 1600kb, a private, netsted sub-class is declared, and the function is inlined to it, because
    --- End diff --
    
    Fixed: https://github.com/apache/spark/pull/18075/commits/90a907acab6a074283908c0a3784af2c7d32cbce#diff-8bcc5aea39c73d4bf38aef6f6951d42cL740


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122122445
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,118 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  val outerClassName = "OuterClass"
    --- End diff --
    
    nit: `private val`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    reviewing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r118309307
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala ---
    @@ -299,6 +297,9 @@ case class SampleExec(
               | }
              """.stripMargin.trim)
     
    +      ctx.addMutableState(s"$samplerClass<UnsafeRow>", sampler,
    --- End diff --
    
    Ah, got it. I overlooked the new dependency regarding `initSamplerFuncName`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78059/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18075
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18075: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18075#discussion_r122356960
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -233,10 +222,118 @@ class CodegenContext {
       // The collection of sub-expression result resetting methods that need to be called on each row.
       val subexprFunctions = mutable.ArrayBuffer.empty[String]
     
    -  def declareAddedFunctions(): String = {
    -    addedFunctions.map { case (funcName, funcCode) => funcCode }.mkString("\n")
    +  val outerClassName = "OuterClass"
    +
    +  /**
    +   * Holds the class and instance names to be generated, where `OuterClass` is a placeholder
    +   * standing for whichever class is generated as the outermost class and which will contain any
    +   * nested sub-classes. All other classes and instance names in this list will represent private,
    +   * nested sub-classes.
    +   */
    +  private val classes: mutable.ListBuffer[(String, String)] =
    +    mutable.ListBuffer[(String, String)](outerClassName -> null)
    +
    +  // A map holding the current size in bytes of each class to be generated.
    +  private val classSize: mutable.Map[String, Int] =
    +    mutable.Map[String, Int](outerClassName -> 0)
    +
    +  // Nested maps holding function names and their code belonging to each class.
    +  private val classFunctions: mutable.Map[String, mutable.Map[String, String]] =
    +    mutable.Map(outerClassName -> mutable.Map.empty[String, String])
    +
    +  // Returns the size of the most recently added class.
    +  private def currClassSize(): Int = classSize(classes.head._1)
    +
    +  // Returns the class name and instance name for the most recently added class.
    +  private def currClass(): (String, String) = classes.head
    +
    +  // Adds a new class. Requires the class' name, and its instance name.
    +  private def addClass(className: String, classInstance: String): Unit = {
    +    classes.prepend(className -> classInstance)
    +    classSize += className -> 0
    +    classFunctions += className -> mutable.Map.empty[String, String]
    +  }
    +
    +  /**
    +   * Adds a function to the generated class. If the code for the `OuterClass` grows too large, the
    +   * function will be inlined into a new private, nested class, and a class-qualified name for the
    --- End diff --
    
    nit: class instance-qualified name


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org