You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2017/10/12 08:21:40 UTC

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19480

    [SPARK-22226] splitExpression can create too many method calls in the outer class

    ## What changes were proposed in this pull request?
    
    SPARK-18016 introduced {{NestedClass}} to avoid that the many methods generated by {{splitExpressions}} contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in the {{NestedClass}}, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: a {{Methodref}} and a {{Utf8}} entry (you can easily check this compiling a simple sample class with {{janinoc}} and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one:
    
    ```
    class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
    ...
      public UnsafeRow apply(InternalRow i) {
         rowWriter.zeroOutNullBytes();
         apply_0(i);
         apply_1(i);
    ...
        nestedClassInstance.apply_862(i);
        nestedClassInstance.apply_863(i);
    ...
      }
    ...
      private class NestedClass {
        private void apply_862(InternalRow i) { ... }
        private void apply_863(InternalRow i) { ... }
    ...
      }
    }
    ```
    
    This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by {{splitExpression}} in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes:
    
    ```
    class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
    ...
      public UnsafeRow apply(InternalRow i) {
         rowWriter.zeroOutNullBytes();
         apply_0(i);
         apply_1(i);
         ...
         nestedClassInstance.apply(i);
         nestedClassInstance1.apply(i);
         ...
      }
    ...
      private class NestedClass {
        private void apply_862(InternalRow i) { ... }
        private void apply_863(InternalRow i) { ... }
    ...
        private void apply(InternalRow i) {
          apply_862(i);
          apply_863(i);
          ...
        }
      }
    }
    ```
    
    ## How was this patch tested?
    
    Added UT and existing UTs


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22226

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19480.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19480
    
----
commit c2cc295fd85a7a0e42debc954311ff74f5b52962
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-10-06T15:33:08Z

    add a method for each inner class and use it in the superclass

commit d3a5b872e5446e1205a91498d977af6e6259e58b
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-10-09T14:21:45Z

    Adding UT and modifying class size limit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144246243
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    we might, but I think this would not be the best option, since in this way we are "reusing" some ConstantPool entries (like `Utf8` for the name of the function) and it keeps everything about those things in the same class.
    Moreover, foreach instance referenced from other classes, like it would be in that case, we are adding new entries to the Constant Pool of that class.
    Thus this is the option which minimizes the number of entries added to the Constant Pools.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83047/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146864067
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    IIUC, 4-space indentation is used when it is followed by 2-space indentation. Here is [an example](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L49).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83047/testReport)** for PR 19480 at commit [`0215139`](https://github.com/apache/spark/commit/02151392fedca69c784c039a2e39c648dfd1b894).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144287183
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    --- End diff --
    
    Yeah, I think it can only happen at the first or last sub-class functions. Seems the functions might be only included in two sub-classes. Most of functions will be wrapped in one function call.
    
    I'm not sure the proper threshold for this, maybe 5?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83079 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83079/testReport)** for PR 19480 at commit [`be84c4d`](https://github.com/apache/spark/commit/be84c4dc3fe40a939e9722f4ddb94e309aa7c841).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r145263245
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -1010,6 +1079,10 @@ object CodeGenerator extends Logging {
       // This is the value of HugeMethodLimit in the OpenJDK JVM settings
       val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000
     
    +  // This is the threshold over which the methods in a inner class are grouped in a single
    --- End diff --
    
    nit. `in a inner class` -> `in an inner class`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144281860
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +78,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the superclass, otherwise it contains the name of the
    --- End diff --
    
    superclass here too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    nit: Do we need `nestedClassInstance1.apply(i);` in the description?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147060120
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,22 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(
    --- End diff --
    
    Rename it to `NewFunctionSpec`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83127/testReport)** for PR 19480 at commit [`36d8e2e`](https://github.com/apache/spark/commit/36d8e2e1596266f6f7552e2b21105d167e55efc1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82734/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144236254
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +291,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
    --- End diff --
    
    Any reason to change this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242361
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    --- End diff --
    
    we do need for methods like `compare` where we have to return something.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147061443
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +832,49 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    --- End diff --
    
    move this just one line before line 876?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144287312
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    Sounds reasonable to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82758/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @bdrillard thanks but we removed the "guilty" test case for many reasons. Thank you anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    sorry, the test failure is due to an OutOfMemory exception. I don't know whether it is possible to change the heap size used by sbt in the Jenkins job and I am not sure this is the right thing to do.
    
    The problem is caused by the UT which reproduces the error in a end-to-end example. Nonetheless, there is the first UT I added which tests that the patch is working. IMHO we might get rid of this new UT and rely on the original one in order not to waste resources.
    Let me know what you think about this and what I should do.
    Thanks. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144541081
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    @viirya I have a good and a bad news... Thanks to your suggestion I have been able to understand and reproduce the issue. Moreover, I found also another issue which is fixed by this problem and I am adding a UT for that too: in some cases, we might have a 
    ```
    Code of method apply(...) grows beyond 64 KB
    ```
    And with this PR the problem is fixed.
    
    The bad thing is that the UT you provided still fails, but with a different error: actually it is always a Constant Pool limit exceeded exception, but it is in a NestedClass. From my analysis, this is caused by another problem, ie. that we might reference too many fields of the superclass in the NestedClasses. This might be addressed maybe trying to tune the magic number which I brought to 1000k in this PR, but I am pretty sure that it will be also addressed by the ongoing PR for SPARK-18016, since he is trying to reduce the number of variables. Thus I consider this out of scope for this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @gatorsmile sorry, I wasn't aware. Thank you anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82858/testReport)** for PR 19480 at commit [`95b0ad8`](https://github.com/apache/spark/commit/95b0ad82a68828055c45dbd729c753fae2007a5c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144282402
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    --- End diff --
    
    If `subclassFunctions.length` is only 1, I think we don't need to add redundant wrapper caller?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144452862
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -1010,6 +1079,8 @@ object CodeGenerator extends Logging {
       // This is the value of HugeMethodLimit in the OpenJDK JVM settings
       val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000
     
    +  val MERGE_SPLIT_METHODS_THRESHOLD = 3
    --- End diff --
    
    Another magic number?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144688074
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    yes, this is the reason why I created [this PR](https://github.com/apache/spark/pull/19447): this is an estimation, which might be good 99% of the times, but for those other 1% use cases it would be good to be able to tune it to prevent this error IMHO.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82735/testReport)** for PR 19480 at commit [`61cc445`](https://github.com/apache/spark/commit/61cc4458613722bba99fd92e9b30c1d797916846).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144263743
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    yes, but we are likely to recreate the same issue the PR is trying to solve in the last `NestedClass`. The problem is that a method call, despite it is few bytes of code, introduce multiple entries in the constant pool. If we use `addNewFunction` here, we will end up creating all the wrapper methods in the last `NestedClass`, which will have the same issues as the outer class, becoming a new bottelneck.
    Instead, in this way, we are splitting the method call among the several nested classes and each of them has its owns, thus it is reusing some constant pool entries.
    As you pointed out, nested classes are less critical than the outer one, because the attributes are declared in the outer class and currently all the many small methods are called there. If we put `addNewFunction` here, instead, we are making critical the last nested function because we are adding there all the many small function calls.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144559514
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    @mgaido91 Thanks for trying it. Yeah, those expressions like `skewness` are very complicated, so they're likely to cause the issue you encountered.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83084/testReport)** for PR 19480 at commit [`a39098d`](https://github.com/apache/spark/commit/a39098dc0319cbd207e132aed7214f1c02245ec6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144451252
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +  subclassInstance: Option[String])
    --- End diff --
    
    Indent issues.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    The current change LGTM.
    
    If you can have an end-to-end test for this. This might be better. This seems not an issue very easy to encounter by end users. Can you try to have one? @mgaido91 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by bdrillard <gi...@git.apache.org>.

Github user bdrillard commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144564741
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    @gatorsmile it's a byte threshold, similar to the 1024 byte threshold set in `splitExpressions`. We can't know exactly how much code will contribute to the constant pool, that is, there's no easy static analysis we can perform on a block of code to say "this code will contribute `n` entries to the constant pool", we only know the size of the code is strongly correlated to entries in the constant pool. We're trying to keep the number of generated classes as low as possible while also grouping enough of the code to avoid the constant pool error. 
    
    In short,  I tested different types of schemas with many columns to find what the value could be set to empirically.
    
    There's no particular harm in setting the value lower as is done here if it helps us avoid a known constant pool error case. Doing so would effectively reduce the number of expressions each nested class holds, and so also increase the number of nested classes in total. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242061
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +291,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
    --- End diff --
    
    Is it for the possibly to be add method grouping all split methods? If yes, we can add the method into an inner class (see https://github.com/apache/spark/pull/19480#discussion_r144240879), so we don't need to change this threshold.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144258373
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    It is also convincing that nested classes should be less. So maybe this is not a serious issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @HyukjinKwon or @cloud-fan May you help to trigger the jenkins test? Thank you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83047/testReport)** for PR 19480 at commit [`0215139`](https://github.com/apache/spark/commit/02151392fedca69c784c039a2e39c648dfd1b894).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147452803
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,91 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
    +
    +      val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
    +        innerClassFunctions,
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesFunctionCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.LinkedHashMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      val value = f.functionName +: innerClassToFunctions.getOrElse(key, Seq.empty[String])
    +      innerClassToFunctions.update(key, value)
    +    })
    +
    +    val argDefinitionString = arguments.map { case (t, name) => s"$t $name" }.mkString(", ")
    +    val argInvocationString = arguments.map(_._2).mkString(", ")
    +
    +    innerClassToFunctions.flatMap {
    +      case ((innerClassName, innerClassInstance), innerClassFunctions) =>
    +        // for performance reasons, the functions are prepended, instead of appended,
    +        // thus here they are in reversed order
    +        val orderedFunctions = innerClassFunctions.reverse
    +        if (orderedFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +          // Adding a new function to each inner class which contains the invocation of all the
    +          // ones which have been added to that inner class. For example,
    +          //   private class NestedClass {
    +          //     private void apply_862(InternalRow i) { ... }
    +          //     private void apply_863(InternalRow i) { ... }
    +          //       ...
    +          //     private void apply(InternalRow i) {
    +          //       apply_862(i);
    +          //       apply_863(i);
    +          //       ...
    +          //     }
    +          //   }
    +          val body = foldFunctions(orderedFunctions.map(name =>
    +            s"$name($argInvocationString)"))
    --- End diff --
    
    one line or you need to use `{}`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @mgaido91 https://github.com/apache/spark/pull/18377 was reverted. I believe we are unable to merge this to 2.2. Sorry for that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144295623
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    --- End diff --
    
    only up to 4 keys, thanks for the nice catch, I will fix this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @kiszk I added just to show that there is one method for instance, but then I am missing the NestedClass1 declaration. I am adding it. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146860250
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    Thanks for the link @kiszk , but it says that for parameters a 4 space indent is required: look at the example with `class Foo`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144453024
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +831,46 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +            // Adding a new function to each subclass which contains
    +            // the invocation of all the ones which have been added to
    +            // that subclass
    +            val code = s"""
    +                |private $returnType $func($argString) {
    +                |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                      s"$name(${arguments.map(_._2).mkString(", ")})")))}
    --- End diff --
    
    missing `|`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82682/testReport)** for PR 19480 at commit [`bdc2fdb`](https://github.com/apache/spark/commit/bdc2fdb18688b86a1b30d6ea8a692a3e95858049).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144688322
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    You are right @viirya. Sorry, I didn't notice. Yes the problem is that most of the times we have both these issues at the moment, thus solving one is not enough. It turns out that there are some corner cases in which this fix is enough, like the real case I am working on. But it is not easy to reproduce them in a simple way. In this use case there are a lot of complex projections a `dropDuplicate` and some joins after that. But there are query made of thousands of lines of SQL code.
    The only way I have been able to reproduce it easily is in this test case: https://github.com/apache/spark/pull/19480/files#r144302922.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144240084
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    The class to add codes into can be over the threshold too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144240028
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    --- End diff --
    
    I think we don't need `makeSplitFunction` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83119/testReport)** for PR 19480 at commit [`6e8cd00`](https://github.com/apache/spark/commit/6e8cd00e6eab9fb64faae65214547457a86bbfcb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144581126
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    I've just tried this. Seems we can simplify it to:
    ```scala
      test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
        val colNumber = 6000
        val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
        val df = sqlContext.createDataFrame(input, StructType(
          (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
        df.dropDuplicates((1 to 5).map(colIndex => s"_$colIndex")).collect()
      }
    ```
    @mgaido91 Can you verify it? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146774741
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +834,46 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +            // Adding a new function to each subclass which contains
    +            // the invocation of all the ones which have been added to
    +            // that subclass
    +            val code = s"""
    +                |private $returnType $func($argString) {
    +                |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                      s"$name(${arguments.map(_._2).mkString(", ")})")))}
    --- End diff --
    
    Instead of inlining a function, why not creating a variable before `val code`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83129/testReport)** for PR 19480 at commit [`4952880`](https://github.com/apache/spark/commit/4952880dcb9830616fde1c2c62f148347bbb0b55).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83052/testReport)** for PR 19480 at commit [`41c0b2b`](https://github.com/apache/spark/commit/41c0b2ba6a4e0afe9acd804d9d417f4882e519f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83079 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83079/testReport)** for PR 19480 at commit [`be84c4d`](https://github.com/apache/spark/commit/be84c4dc3fe40a939e9722f4ddb94e309aa7c841).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    LGTM pending Jenkins


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82682/testReport)** for PR 19480 at commit [`bdc2fdb`](https://github.com/apache/spark/commit/bdc2fdb18688b86a1b30d6ea8a692a3e95858049).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Could you resolve conflicts, @mgaido91 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82734/testReport)** for PR 19480 at commit [`c4601b4`](https://github.com/apache/spark/commit/c4601b4ae1f6ba618e1a92e75b06faaf8d14bee3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144245263
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    --- End diff --
    
    Yes, but this are still splits, just bigger.
    The case which is critical for this is `compare`. In the other cases this function is `identity`. Let's consider that case: without using it here, you'd get a compilation error, because these "grouping methods" don't return anything by default; instead, if you add it here, you have the default `return 0;` statement at the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146865925
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    In this example, there is 2 space indentation because there is a `extends` and according to the link you provided me, the `extends` requires a 2 space indentation, while the parameters a 4 space one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144244262
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    Can't we call them with instance?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144253074
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    sorry, I didn't get completely you last comments, might you elaborate it a bit more? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82858/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144452720
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    What is the reason we chose this magic number in the original PR? cc @bdrillard @kiszk 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82735/testReport)** for PR 19480 at commit [`61cc445`](https://github.com/apache/spark/commit/61cc4458613722bba99fd92e9b30c1d797916846).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146775448
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +834,46 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +            // Adding a new function to each subclass which contains
    +            // the invocation of all the ones which have been added to
    +            // that subclass
    +            val code = s"""
    +                |private $returnType $func($argString) {
    +                |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                      s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +                |}
    +              """.stripMargin
    +            addNewFunctionToClass(func, code, subclassName)
    +            Seq(s"$subclassInstance.$func")
    +          } else {
    +            subclassFunctions.map(f => s"$subclassInstance.$f")
    +          }
    +        }
    +
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    --- End diff --
    
    The same suggestion. Do not inline a complex expression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242223
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +78,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the superclass, otherwise it contains the name of the
    --- End diff --
    
    You're right, it's outerclass, thanks, my bad.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83129/testReport)** for PR 19480 at commit [`4952880`](https://github.com/apache/spark/commit/4952880dcb9830616fde1c2c62f148347bbb0b55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147443947
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,82 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
    +
    +      val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
    +        innerClassFunctions,
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesFunctionCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.LinkedHashMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    --- End diff --
    
    ```Scala
        functions.foreach { f =>
          val key = (f.innerClassName.get, f.innerClassInstance.get)
          val value = f.functionName +: innerClassToFunctions.getOrElse(key, Seq.empty[String])
          innerClassToFunctions.put(key, value)
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82758/testReport)** for PR 19480 at commit [`37506dc`](https://github.com/apache/spark/commit/37506dcc380cf5c14ea929b33f9e8e26efdbcb8d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @mgaido91 Thank you for working on this! Based on your explanation, sounds like a reasonable direction to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144291788
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    --- End diff --
    
    This might not be very intuitive for later readers. We'd better to add comments to explain it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144555259
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not cause \"Code of method grows beyond 64 KB\"") {
    --- End diff --
    
    `splitExpressions should not generate codes beyond 64KB`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146796399
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    may I ask you to kindly explain me which is the right indentation in this case? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144678394
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    Hmm, interesting. Because your committed test can be seen as to do `dropDuplicates` on a 6000 column dataframe (`df.select(funcs: _*)`). The `funcs` are actually performed at the lower `Project`.
    
    ```scala
        val df2 = df.select(funcs: _*).cache()
        df2.collect()  // Pass
        val df3 = df2.dropDuplicates((1 to 5).map(colIndex => s"_$colIndex"))
        df3.collect()  // Exception
    ```
    So I think the simplified has the same effect?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83127/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144302922
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    Besides the unit test, can you provide an end-to-end case that can trigger this issue too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144480771
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    I was adding it to `DataFrameAggregateSuite`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242646
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +291,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
    --- End diff --
    
    Can you try https://github.com/apache/spark/pull/19480#discussion_r144240879, so we may avoid changing this threshold?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82697/testReport)** for PR 19480 at commit [`76b5489`](https://github.com/apache/spark/commit/76b5489d908f5a979280859021d61ed91cd106a7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82697/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83127/testReport)** for PR 19480 at commit [`36d8e2e`](https://github.com/apache/spark/commit/36d8e2e1596266f6f7552e2b21105d167e55efc1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144236599
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +78,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the superclass, otherwise it contains the name of the
    --- End diff --
    
    superclass or outerclass?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144251649
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    Worse, I guess the functions added by `addNewFunction` are called at the outerclass at most, so basically this way is going to increase the Constant Pool entries in the outerclass.
    
    I agreed that in the suggested way, we increase Constant Pool entries in the inner classes, but I think inner classes can be more free from this issue?
    
    This is just my guess. So please let me know if I reason it incorrectly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147234068
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesMethodsCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    +    // for performance reasons, the functions are prepended, instead of appended,
    +    // thus they are in reversed order
    --- End diff --
    
    Have you seen the perf issue when the number of functions is just thousands?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by bdrillard <gi...@git.apache.org>.

Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @mgaido91 It is possible to increase the heap allocated during testing if that seems like a desirable thing to do (I did so for #18075, but the current default is already 4GB), see [1] and [2] depending on which SQL module the test is in.
    
    [1] https://github.com/apache/spark/blob/master/sql/core/pom.xml#L198
    [2] https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L137


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144554795
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    @mgaido91 Do you meant "2: compact primitive declarations into arrays" in SPARK-18016?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r145267463
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,16 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    --- End diff --
    
    Yes, this would be the maximum number currently, but #19518 will generate a great improvement for that. Actually I don't have an exact answer because this depends on many factors, like which transformations are performed, which are the datatypes, .... So I am not able to give an answer on that, sorry.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83084/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83084/testReport)** for PR 19480 at commit [`a39098d`](https://github.com/apache/spark/commit/a39098dc0319cbd207e132aed7214f1c02245ec6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144244945
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    --- End diff --
    
    Oh. I see. Nvm.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144240879
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    So shall we call `addNewFunction` to add the caller method into an inner class and call those split functions in that?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146773769
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    Create a variable in `object CodeGenerator`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146775221
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +834,46 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
    --- End diff --
    
    Normally, we do not like such a long chain of function calls. If possible please split it. Thanks! 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82687/testReport)** for PR 19480 at commit [`831fc40`](https://github.com/apache/spark/commit/831fc40a941620671f001cb98fc9d05c299db897).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147452921
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,91 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
    +
    +      val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
    +        innerClassFunctions,
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesFunctionCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.LinkedHashMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      val value = f.functionName +: innerClassToFunctions.getOrElse(key, Seq.empty[String])
    +      innerClassToFunctions.update(key, value)
    --- End diff --
    
    Nit: `update` -> `put`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144452981
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    Instead of copying your customer codes, can you making a fake one?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242072
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +291,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
    --- End diff --
    
    yes, without this change, I got the Constant Pool limit exception of the {{NestedClass}}. Actually I tried to address this issue in this PR (https://github.com/apache/spark/pull/19447), but without the other changes in thee current PR I wasn't able to create a UT for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144288530
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    --- End diff --
    
    Thanks, I will create a field to make this easy to be changed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144480289
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    thank you very much for your help @viirya ! In my use cases it seemed to be connected to the `dropDuplicates` method and I focused on it, but thanks to your suggestion now I realize that `dropDuplicates` by itself is not enough, it needs also some functions applied to columns to generate the issue! Thank you so much. Where should I add this test case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147060840
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,22 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    --- End diff --
    
    I saw three ways to represent the same concept. subclass, inner class, nested classes. 
    
    How about renaming all of them to inner classes? Could you go over all the code changes in this PR to make them consistent?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147061319
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +832,49 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassToFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +      val innerClassFunctions = innerClassToFunctions.flatMap {
    +        case ((subclassName, subclassInstance), subclassFunctions) =>
    +          if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +            // Adding a new function to each subclass which contains
    +            // the invocation of all the ones which have been added to
    +            // that subclass
    +            val body = foldFunctions(subclassFunctions.map(name => s"$name($argsString)"))
    +            val code = s"""
    +                |private $returnType $func($argString) {
    +                |  ${makeSplitFunction(body)}
    +                |}
    +              """.stripMargin
    +            addNewFunctionToClass(func, code, subclassName)
    +            Seq(s"$subclassInstance.$func")
    +          } else {
    +            subclassFunctions.map(f => s"$subclassInstance.$f")
    +          }
    +        }
    --- End diff --
    
    Build a separate private function for generating innerClassFunctions? Now, the function `splitExpressions ` is pretty large after this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82858/testReport)** for PR 19480 at commit [`95b0ad8`](https://github.com/apache/spark/commit/95b0ad82a68828055c45dbd729c753fae2007a5c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83079/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144258113
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    Oh. I meant the increased inner classes/instances are added into the outerclass's Constant Pool entries. Outerclass should be more likely to be filled by various variables/methods...than inner classes. So I think to add into Constant Pool of inner classes should be less serious issue than outerclass.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Thank you @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144684397
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    Yeah, actually during several tries, I found that setting the value lower can somehow reduce the chance to hit constant pool limit exception in nested classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147220632
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    --- End diff --
    
    ```Scala
          val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
    
          val argsString = arguments.map(_._2).mkString(", ")
          val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
    
          val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
            innerClassFunctions,
            func,
            arguments,
            returnType,
            makeSplitFunction,
            foldFunctions)
    
          foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144283907
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +78,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the superclass, otherwise it contains the name of the
    --- End diff --
    
    thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146883884
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    I see. This is just my understanding.  
    Let us wait for explanation fro @gatorsmile.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82697/testReport)** for PR 19480 at commit [`76b5489`](https://github.com/apache/spark/commit/76b5489d908f5a979280859021d61ed91cd106a7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144242883
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    no, this is made intentionally because if we add it to the {{NestedClass}} where these methods are added: they are not available in other nested classes...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144284488
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    --- End diff --
    
    you're right, even though this is very unlikely. We might also think of a threshold maybe. Which is the right approach according to you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144688592
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    Btw, since this test didn't test what we want to test. We should remove it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144250132
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    Because `addNewFunction` is not only used by `splitExpressions`, this threshold change may increase the number of nested sub-classes and then of course the instances. So I think it is also bad for the Constant Pool limit?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144688579
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    The unit test added into `CodeGenerationSuite` looks sufficient for identifying this particular issue regarding constant pool limit in outer class due to too many method calls.
    
    It is hard to contrive an end-to-end test so far purely for reproducing this particular issue. At least I failed to contrive one after several tries.
    
    So let wait if anyone has the chance or insights to create one.
    
    If no, I think the unit case in `CodeGenerationSuite` should be good enough.
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @mgaido91 thanks, this example makes sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82687/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144480542
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +831,46 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +            // Adding a new function to each subclass which contains
    +            // the invocation of all the ones which have been added to
    +            // that subclass
    +            val code = s"""
    +                |private $returnType $func($argString) {
    +                |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                      s"$name(${arguments.map(_._2).mkString(", ")})")))}
    --- End diff --
    
    we are inside the string interpolator here, so `|` is not missing


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147446840
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,82 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
    +
    +      val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
    +        innerClassFunctions,
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesFunctionCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.LinkedHashMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    +
    +    val argDefinitionString = arguments.map { case (t, name) => s"$t $name" }.mkString(", ")
    +    val argInvocationString = arguments.map(_._2).mkString(", ")
    +
    +    innerClassToFunctions.flatMap {
    +      case ((innerClassName, innerClassInstance), innerClassFunctions) =>
    +        // for performance reasons, the functions are prepended, instead of appended,
    +        // thus here they are in reversed order
    +        val orderedFunctions = innerClassFunctions.reverse
    +        if (orderedFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +          // Adding a new function to each inner class which contains
    +          // the invocation of all the ones which have been added to
    +          // that inner class
    +          val body = foldFunctions(orderedFunctions.map(name =>
    +            s"$name($argInvocationString)"))
    +          val code = s"""
    +              |private $returnType $funcName($argDefinitionString) {
    +              |  ${makeSplitFunction(body)}
    +              |}
    +            """.stripMargin
    --- End diff --
    
    ```Scala
              // Adding a new function to each inner class which contains the invocation of all the
              // ones which have been added to that inner class. For example,
              //   private class NestedClass {
              //     private void apply_862(InternalRow i) { ... }
              //     private void apply_863(InternalRow i) { ... }
              //       ...
              //     private void apply(InternalRow i) {
              //       apply_862(i);
              //       apply_863(i);
              //       ...
              //     }
              //   }
              val body = foldFunctions(orderedFunctions.map(name => s"$name($argInvocationString)"))
              val code =
                s"""
                  |private $returnType $funcName($argDefinitionString) {
                  |  ${makeSplitFunction(body)}
                  |}
                 """.stripMargin
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144562615
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    @viirya exactly, I meant that. Thank you for your suggestion. You have been very helpful to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144683715
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    The following which can't be passed in current master branch can be passed with your fix. I didn't see OOM issue and nestedclass constant pool issue.
    
    ```scala
      test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
        val colNumber = 5000
        val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
        val df = sqlContext.createDataFrame(input, StructType(
          (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
        val funcs = (1 to colNumber).map { colIndex =>
          val colName = s"_$colIndex"
          col(colName).cast(LongType)
        }
        df.select(funcs: _*).dropDuplicates((1 to 5).map(colIndex => s"_$colIndex")).collect()
      }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144481305
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -1010,6 +1079,8 @@ object CodeGenerator extends Logging {
       // This is the value of HugeMethodLimit in the OpenJDK JVM settings
       val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000
     
    +  val MERGE_SPLIT_METHODS_THRESHOLD = 3
    --- End diff --
    
    @gatorsmile it comes from [this discussion](https://github.com/apache/spark/pull/19480/files/bdc2fdb18688b86a1b30d6ea8a692a3e95858049#r144282402).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144289537
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    --- End diff --
    
    So the calls to outerclass functions should not be an issue, even they could be many?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147234340
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesMethodsCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    +    // for performance reasons, the functions are prepended, instead of appended,
    +    // thus they are in reversed order
    +    innerClassToFunctions.transform { case (_, functions) => functions.reverse }
    +
    +    val argDefinitionString = arguments.map { case (t, name) => s"$t $name" }.mkString(", ")
    +    val argInvocationString = arguments.map(_._2).mkString(", ")
    +
    +    innerClassToFunctions.flatMap {
    +      case ((innerClassName, innerClassInstance), innerClassFunctions) =>
    +        if (innerClassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +          // Adding a new function to each inner class which contains
    +          // the invocation of all the ones which have been added to
    +          // that inner class
    +          val body = foldFunctions(innerClassFunctions.map(name =>
    +            s"$name($argInvocationString)"))
    +          val code = s"""
    +              |private $returnType $funcName($argDefinitionString) {
    +              |  ${makeSplitFunction(body)}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(funcName, code, innerClassName)
    +          Seq(s"$innerClassInstance.$funcName")
    --- End diff --
    
    The same here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82758/testReport)** for PR 19480 at commit [`37506dc`](https://github.com/apache/spark/commit/37506dcc380cf5c14ea929b33f9e8e26efdbcb8d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Sounds good to me. Sorry for being late since I was busy last week.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144252483
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    +                    s"$name(${arguments.map(_._2).mkString(", ")})")))}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(func, code, subclassName)
    --- End diff --
    
    yes,  for each new `NestedClassX` we have a new `nestedClassInstanceX` attribute to the outer class. Every attribute brings two entries in the constant pool. Currently, I have never been able to generate a class with more than 1-2 nested classes. Since we are roughly setting this value to half of the previous one, we can say that we are going to have 2-4 nested classes, which means we are adding at most 10 entries in the constant pool of the outer class: I think this is not a big deal.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Thank you all for your reviews. I addressed the last comments.
    @gatorsmile meanwhile may I ask you to backport this to branch-2.2 when it will be merged?
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146855751
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    I think that [2-space indentation](https://github.com/databricks/scala-style-guide#indent) is expected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144310237
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    I have a use case where I faced this problem. And I tried this patch on it. Unfortunately this contains a very complex business logic and I have not been able to reproduce it in a simple one. But if needed, I can try again.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82861/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144440892
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -1010,6 +1079,8 @@ object CodeGenerator extends Logging {
       // This is the value of HugeMethodLimit in the OpenJDK JVM settings
       val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000
     
    +  val MERGE_SPLIT_METHODS_THRESHOLD = 3
    --- End diff --
    
    Let's add a short comment for this too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82861/testReport)** for PR 19480 at commit [`bce3616`](https://github.com/apache/spark/commit/bce3616cccfb5c1b1dc7fca32d17764fda142e7d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r145264550
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,16 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    --- End diff --
    
    @mgaido91 . 
    I'm wondering if this is the new maximum number of columns tested in Spark?
    After your patch, what will be the minimum value of `colNumber` to cause failures in Spark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144685307
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    It seems hard to make an end-to-end test for the reported issue after several tries.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144243356
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    +        }
    +        .map { case ((subclassName, subclassInstance), subclassFunctions) =>
    +          // Adding a new function to each subclass which contains
    +          // the invocation of all the ones which have been added to
    +          // that subclass
    +          val code = s"""
    +              |private $returnType $func($argString) {
    +              |  ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
    --- End diff --
    
    This is going to call the split methods, not the expression codes. In original `splitExpressions`, `makeSplitFunction` only works on the expression codes, not the calls of the methods. But here it applies it on calls of the methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83119/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @kiszk Does this look good to you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144684909
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    But actually I tested this committed test, once I lowered the threshold from 1600000 to 1000000, it can pass too in current master branch.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82735/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144294718
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    --- End diff --
    
    I will, thanks for the suggestion


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146859947
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +292,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
         // sub-class.
         val (className, classInstance) = if (inlineToOuterClass) {
           outerClassName -> ""
    -    } else if (currClassSize > 1600000) {
    +    } else if (currClassSize > 1000000) {
    --- End diff --
    
    I can do that for this PR. But what do you think about introducing an internal configuration for this as I proposed in the other PR? If this should not be done, I can close the other PR: I think that the discussion here and there showed the reasons for its creation. If they are not enough, then I will close the PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144459033
  
    --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala ---
    @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
         }
       }
     
    +  test("SPARK-22226: group splitted expressions into one method per nested class") {
    --- End diff --
    
    @mgaido91 I can reproduce the issue by following test case. You can check it:
    
    ```scala
      test("SPARK-22226: too much splitted expressions should not exceen constant pool limit") {
        withSQLConf(
          (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) {
          val colNumber = 1000
          val baseDF = spark.range(10).toDF()
          val newCols = (1 to colNumber).map { colIndex =>
            expr(s"id + $colIndex").as(s"_$colIndex")
          }
          val input = baseDF.select(newCols: _*)
          val aggs2 = (1 to colNumber).flatMap { colIndex =>
            val colName = s"_$colIndex"
            Seq(expr(s"stddev($colName)"),
              expr(s"stddev_samp($colName)"),
              expr(s"stddev_pop($colName)"),
              expr(s"variance($colName)"),
              expr(s"var_samp($colName)"),
              expr(s"var_pop($colName)"),
              expr(s"skewness($colName)"),
              expr(s"kurtosis($colName)"))
          }
          input.agg(aggs2.head, aggs2.tail: _*).collect()
        }
      }
    ```
    
    
    ```
    [info]   Cause: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeExc
    eption: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection 
    has grown past JVM limit of 0xFFFF
    [info]   at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressi
    ons$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1079)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83119/testReport)** for PR 19480 at commit [`6e8cd00`](https://github.com/apache/spark/commit/6e8cd00e6eab9fb64faae65214547457a86bbfcb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82687/testReport)** for PR 19480 at commit [`831fc40`](https://github.com/apache/spark/commit/831fc40a941620671f001cb98fc9d05c299db897).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226] splitExpression can create too many method...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82682/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144588780
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
           Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
       }
    +
    +  test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
    +    val colNumber = 10000
    +    val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
    +    val df = sqlContext.createDataFrame(input, StructType(
    +      (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
    +    val newCols = (1 to colNumber).flatMap { colIndex =>
    +      Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
    +        expr(s"sqrt(_$colIndex)"))
    +    }
    +    df.select(newCols: _*).collect()
    +  }
    +
    +  test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") {
    --- End diff --
    
    this still fails for the same reason your first suggestion still fails


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146773177
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    The indent issues.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144243121
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -277,13 +291,25 @@ class CodegenContext {
           funcName: String,
           funcCode: String,
           inlineToOuterClass: Boolean = false): String = {
    +    val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
    +    newFunction match {
    +      case NewFunction(functionName, None, None) => functionName
    +      case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
    +        subclassInstance + "." + functionName
    +    }
    +  }
    +
    +  private[this] def addNewFunctionInternal(
    +      funcName: String,
    +      funcCode: String,
    +      inlineToOuterClass: Boolean): NewFunction = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    +    // threshold of 1000k bytes to determine when a function should be inlined to a private, nested
    --- End diff --
    
    I can't change that for the reason I explained there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144291200
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    --- End diff --
    
    The order of function calls should be important and be kept. Does the map guarantee that updated map still preserves the insertion order?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #83052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83052/testReport)** for PR 19480 at commit [`41c0b2b`](https://github.com/apache/spark/commit/41c0b2ba6a4e0afe9acd804d9d417f4882e519f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82734/testReport)** for PR 19480 at commit [`c4601b4`](https://github.com/apache/spark/commit/c4601b4ae1f6ba618e1a92e75b06faaf8d14bee3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    @kiszk Thanks!
    
    @mgaido91 I have not reviewed it carefully. Will do another pass in the next few days. Thanks for your fix!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83129/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147233642
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesMethodsCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]]
    --- End diff --
    
    Why not using `LinkedHashMap`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226] splitExpression can create too many...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r144291974
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -798,10 +830,35 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    --- End diff --
    
    exactly, because since they are defined there, they add no entry to the constant pool


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83052/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    **[Test build #82861 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82861/testReport)** for PR 19480 at commit [`bce3616`](https://github.com/apache/spark/commit/bce3616cccfb5c1b1dc7fca32d17764fda142e7d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    cc @rednaxelafx Do you have a bandwidth to review this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147062715
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +832,49 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.subclassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +
    +      // Here we handle all the methods which have been added to the nested subclasses and
    +      // not to the outer class.
    +      // Since they can be many, their direct invocation in the outer class adds many entries
    +      // to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +      // To avoid this problem, we group them and we call only the grouping methods in the
    +      // outer class.
    +      val innerClassToFunctions = functions
    +        .filter(_.subclassName.isDefined)
    +        .foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
    +          val key = (f.subclassName.get, f.subclassInstance.get)
    +          acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
    --- End diff --
    
    To improve the readability, let us avoid using `foldLeft` here. You can use a mutable map. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147234320
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesMethodsCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    +    // for performance reasons, the functions are prepended, instead of appended,
    +    // thus they are in reversed order
    +    innerClassToFunctions.transform { case (_, functions) => functions.reverse }
    +
    +    val argDefinitionString = arguments.map { case (t, name) => s"$t $name" }.mkString(", ")
    +    val argInvocationString = arguments.map(_._2).mkString(", ")
    +
    +    innerClassToFunctions.flatMap {
    +      case ((innerClassName, innerClassInstance), innerClassFunctions) =>
    +        if (innerClassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
    +          // Adding a new function to each inner class which contains
    +          // the invocation of all the ones which have been added to
    +          // that inner class
    +          val body = foldFunctions(innerClassFunctions.map(name =>
    +            s"$name($argInvocationString)"))
    +          val code = s"""
    +              |private $returnType $funcName($argDefinitionString) {
    +              |  ${makeSplitFunction(body)}
    +              |}
    +            """.stripMargin
    +          addNewFunctionToClass(funcName, code, innerClassName)
    +          Seq(s"$innerClassInstance.$funcName")
    +        } else {
    +          innerClassFunctions.map(f => s"$innerClassInstance.$f")
    --- End diff --
    
    Here, we are returning the function's invocations. Thus, passing arguments is more straightforward
    ```
    innerClassFunctions.reverse.map(f => s"$innerClassInstance.$f($argInvocationString)")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r146905964
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -78,6 +79,20 @@ case class SubExprEliminationState(isNull: String, value: String)
     case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])
     
     /**
    + * The main information about a new added function.
    + *
    + * @param functionName String representing the name of the function
    + * @param subclassName Optional value which is empty if the function is added to
    + *                     the outer class, otherwise it contains the name of the
    + *                     inner class in which the function has been added.
    + * @param subclassInstance Optional value which is empty if the function is added to
    + *                         the outer class, otherwise it contains the name of the
    + *                         instance of the inner class in the outer class.
    + */
    +private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
    +    subclassInstance: Option[String])
    --- End diff --
    
    See the link: https://github.com/databricks/scala-style-guide#spacing-and-indentation
    
    If it does not fit one line, we do it like
    ```Scala
    private[codegen] case class NewFunction(
        functionName: String,
        subclassName: Option[String],
        subclassInstance: Option[String])
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    LGTM except two comments. 
    
    Thanks for your work!!! @mgaido91 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19480: [SPARK-22226][SQL] splitExpression can create too many m...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19480
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19480


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19480: [SPARK-22226][SQL] splitExpression can create too...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19480#discussion_r147346995
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -801,10 +831,84 @@ class CodegenContext {
                |  ${makeSplitFunction(body)}
                |}
              """.stripMargin
    -        addNewFunction(name, code)
    +        addNewFunctionInternal(name, code, inlineToOuterClass = false)
           }
     
    -      foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
    +      // Here we store all the methods which have been added to the outer class.
    +      val outerClassFunctions = functions
    +        .filter(_.innerClassName.isEmpty)
    +        .map(_.functionName)
    +
    +      val innerClassFunctions = generateInnerClassesMethodsCalls(
    +        functions.filter(_.innerClassName.nonEmpty),
    +        func,
    +        arguments,
    +        returnType,
    +        makeSplitFunction,
    +        foldFunctions)
    +
    +      val argsString = arguments.map(_._2).mkString(", ")
    +      foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
    +        name => s"$name($argsString)"))
    +    }
    +  }
    +
    +  /**
    +   * Here we handle all the methods which have been added to the inner classes and
    +   * not to the outer class.
    +   * Since they can be many, their direct invocation in the outer class adds many entries
    +   * to the outer class' constant pool. This can cause the constant pool to past JVM limit.
    +   * Moreover, this can cause also the outer class method where all the invocations are
    +   * performed to grow beyond the 64k limit.
    +   * To avoid these problems, we group them and we call only the grouping methods in the
    +   * outer class.
    +   *
    +   * @param functions a [[Seq]] of [[NewFunctionSpec]] defined in the inner classes
    +   * @param funcName the split function name base.
    +   * @param arguments the list of (type, name) of the arguments of the split function.
    +   * @param returnType the return type of the split function.
    +   * @param makeSplitFunction makes split function body, e.g. add preparation or cleanup.
    +   * @param foldFunctions folds the split function calls.
    +   * @return an [[Iterable]] containing the methods' invocations
    +   */
    +  private def generateInnerClassesMethodsCalls(
    +      functions: Seq[NewFunctionSpec],
    +      funcName: String,
    +      arguments: Seq[(String, String)],
    +      returnType: String,
    +      makeSplitFunction: String => String,
    +      foldFunctions: Seq[String] => String): Iterable[String] = {
    +    val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]]
    +    functions.foreach(f => {
    +      val key = (f.innerClassName.get, f.innerClassInstance.get)
    +      innerClassToFunctions.update(key, f.functionName +:
    +        innerClassToFunctions.getOrElse(key, Seq.empty[String]))
    +    })
    +    // for performance reasons, the functions are prepended, instead of appended,
    +    // thus they are in reversed order
    --- End diff --
    
    with 2.000 my tests on my machine (2.5 GHz Intel Core i7 OSX) are:
     - with this solution 3ms
     - with the append at the end 121 ms
    And of course if the number increases, the difference grows.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org