You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2017/10/06 14:05:08 UTC

[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19447

    [SPARK-22215][SQL] Add configuration to set the threshold for generated class

    ## What changes were proposed in this pull request?
    
    SPARK-18016 introduced an arbitrary threshold for the size of a generated class (https://github.com/apache/spark/blob/83488cc3180ca18f829516f550766efb3095881e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L286). This value is hardcoded.
    In some cases making it smaller can help avoiding the error of exceeding the maximum number of entries in the Constant Pool.
    This PR introduces a new configuration parameter, which defaults to the previous value, but it allows to set this to a smaller one if needed.
    
    ## How was this patch tested?
    manual tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22215

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19447
    
----
commit c69be31314d9aa96c3920073beaf7cca46d507fa
Author: Marco Gaido <mg...@hortonworks.com>
Date:   2017-10-06T12:58:36Z

    [SPARK-22215][SQL] Add configuration to set the threshold for generated class

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    @dongjoon-hyun I am not sure how I can test it. The use case in which this was useful is quite complex and I have not been able to reproduce it in a simpler way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143228854
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    thanks for the explanation, I'll add it immediately.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Instead of simply introducing a new conf, we need to answer three questions; otherwise, this is not useful to the users.
    - What is the perf impact of this conf?
    - When should users increase/reduce the conf value?
    - Should we adjust the existing default values? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Thanks @kiszk, that would most probably work, but I am not sure about which "number", which "limit" to set there. This is the reason why I thought that a configuration might have been better, being more flexible.
    
    For the problem 3, no new configuration. I am just adding a new method to the inner classes which wraps all the small methods, in this way they don't contribute to make the outer class constant pool growing. I know that a UT is necessary. Unfortunately so far I have not been able to create it. I will try with lower level APIs and I hope I'll be able to do that. Any suggestion of course is very welcomed.
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143227547
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    Users can change this value in the same way with other public parameters, but  `internal` just means that parameters are not explicitly exposed to users.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Could you add a test case for this? Since it's configurable, you can set a small number and catch some exceptions?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    We still need to add a test case. We also should capture the exception and issue a better one; otherwise, users will not know what they should do when they hit such a confusing error.
    
    cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit this using the current hard-coded value? If so, what is the easy way to write a test case?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143213850
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,13 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
    --- End diff --
    
    The code that you pointed out is not in `CodeGenerator.scala`. In [my case](https://github.com/apache/spark/pull/18966/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR776) at `CodeGenerator.scala`, there are some cases that `SparkEnv.get` is `null`. If you have not seen this `null` in you case, it would be good.
    @gatorsmile will implement a better approach to get this conf soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143217286
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,13 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
    --- End diff --
    
    you better refer other code in `CodeGenerate`:
    
    https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L934


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Thanks for your comment @kiszk. It make sense, actually, that a very small number like 10000 can cause exceptions. I will explain later why this happens, now I will start with a brief introduction about the problem, making a brief summary of my knowledge of the topic.
    
    Code generation can create too many entries for the constant pool of the generated classes. This happens when we have thousands of columns. Many things can add new entries in the constant pool. After the PR you referenced for SPARK-18016, we have many constant pool: one for the outer class and one for each of the inner classes. In particular, in Spark code generation, the problem is that in the constant pool there are entries for:
    
     1 - each attribute declared for a class: all the attributes are added to the outer class. Thus, here we have a limit. We should split them as well, but this is out of my focus now.
     2 - each method declared for a class: the number of methods is limited by splitting them among the inner classes. This is controlled by the "number" which is the subject of this PR.
     3 - each method which is referenced in a class: at the moment, even though the methods are declared in the inner classes, they are always used in the outer class. This is also a limit.
    
    The 1 point explains you the reason why a very low number like 10000 can create the problem. Since in the outer class we have an attribute for each nested class, if we have a lot of them we can reach the limit. With the current value or similar values (in the use case I am working on I set it to 1000000), even with thousands of columns the number of inner classes is very small, thus this is not a problem. But since we have multiple fields for each column, the number of columns is limited by this problem.
    
    In the use case I am working on I hit the problem 2 and 3, in the reverse order actually. I have a fix for the 3 too, even though I have not yet created a PR because I was not able to reproduce that problem either is a simple way which can be used in a UT and I was not sure about which might be the right approach for creating a PR for it (if someone has suggestion for this, please I'd really appreciate some advice). As I told, the use case is rather complex, and all my trials to reproduce it in an easy way were unsuccessful, since I always hit the 1 issue.
    
    Then, after this very large introduction, I hope that it is clear that this PR's goal is to address those cases in which we have a lot of very small functions and this is the driver to reach the constant pool's limit. So far I have been unsuccessful reproducing this issue is an easy way, I am very sorry.
    
    What I would be able to do, in a test case, is to look for the number of inner classes which are created (`CodegenContext.classes`): with a lower number they should be more that with a bigger one. If this is ok as UT I can do that.
    
    Hope that this was clear enough. Please, don't hesitate asking for clarifications if needed.
    
    Thanks.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    How about making a separated function for checking the threshold and then test it like #18810: https://github.com/apache/spark/pull/18810/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR363


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    @mgaido91 When I reduced this value to 10000, [this test case](https://github.com/apache/spark/pull/16648/files#diff-c7f041fda7fd9aa1a5225f86bab4b1b0R69) causes an exception `JVM limit of 0xFFFF`. Unfortunately, my proposal to watch # of `CodegenContext.mutableStates` did not work, too. This is because to generate nested classes does not reduce the number of instance variables (more than 64000 in this test case) in the top class.
    
    I would appreciate it if you would double-check this to avoid possibility of my mistake. If you could not find this PR can help [this test case](https://github.com/apache/spark/pull/16648/files#diff-c7f041fda7fd9aa1a5225f86bab4b1b0R69), could you please clarify what cases can be resolved by this PR by using test cases? Then, we can discuss whether this PR would be useful for users or not.
    
    What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    @gatorsmile I am happy to take it. We have [one test case](https://github.com/apache/spark/pull/16648/files#diff-c7f041fda7fd9aa1a5225f86bab4b1b0R69) that cause the error `JVM limit of 0xFFFF`. I am not sure this new option may help this or not. I will try it this long weekend (Monday is a holiday in Japan).
    
    Ideally, we want to find better metric rather than string length of the class file. 
    Since constant pool entry is used for [various purposes](https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.4), I imagine that the majorities in this cause is `CONSTANT_Fieldref`. (i.e. instance variable). It may be helpful to watch # of instance variables in `CodegenContext.mutableStates`. cc: @rednaxelafx
    
    
    
    @maropu it would be good to make `currClassSize` more precise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Do you mean Spark will work with small value like 1000?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143204826
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,13 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
    --- End diff --
    
    Can this be empty? In https://github.com/apache/spark/blob/83488cc3180ca18f829516f550766efb3095881e/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L80 there is no check about it being `None`...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Yes, what I can do as a test case is that I can check how many inner classes are generated changing the configuration value, is it ok?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    I feel it is a bit annoying to add a parameters for each Constant Pool issue and we better look for solutions so that less parameters (e.g., other metrics as @kiszk suggested) can almost solve the issue. I think splitting classes is not a silver bullet; if we can't compile gen'd code, we should fail over interpreted mode (this is filed in SPARK-21320).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143527821
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,11 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SQLConf.get.generatedClassLengthThreshold
    --- End diff --
    
    Good to know [this discussion](https://github.com/apache/spark/pull/19449#pullrequestreview-67789784)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143228937
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    +        "size is higher of this value, a private nested class is created and used." +
    +        "This is useful to limit the number of named constants in the class " +
    +        "and therefore its Constant Pool. The default is 1600k.")
    +      .intConf
    --- End diff --
    
    Please add `.checkValue` here with a valid minimum and maximum.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Yes, with small values it will produce a lot of small `NestedClass`es, but it will work. Instead, if the value is too high this, all the functions (methods) which are created are inlined in the same class and this can cause an Exception because of exceeding the maximum number of entries in the Constant Pool for the class.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143220404
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    if users cannot set it, how can this parameter be changed?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Here the answers to your questions @gatorsmile , please tell me if I need to elaborate more deeply.
    This conf controls how many inner classes are generated. A big value means that we will have few inner classes and that the outer class and the inner classes will have a lot of methods. A small one means a lot of classes with few values. So,
    1 - I don't think there is a big perf impact. Maybe, if this value is very small, we might have a lot nested class instances and therefore a waste of memory. But since these classes don't have any variable, this is quite negligible to me.
     2 - If the user encounters an Exception like
    ```
    Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass has grown past JVM limit of 0xFFFF
    ```
    then he/she must reduce this value
     3 - I don't think so, because I have not been able to reproduce the problem other than in a very complex use case. So probably that use case is a corner case which can benefit by this change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143223820
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    I think this is not a frequently-used parameter for users. Also, other JVM implimentation-specific parameters are internal, e.g., `WHOLESTAGE_HUGE_METHOD_LIMIT`. cc: @rednaxelafx


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143224951
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    I agree with you, but might you please explain me how to set a parameter which is internal? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Thank you for your clarification. I understand that you originally addressed `JVM limit of 0xFFFF` caused by a lot of small methods.
    
    For 2, my idea is that it would be good to monitor the number of method entries and to split the class if it is large (e.g. 16384). Does it work for your case?
    `if (currClassSize > 1600000 || (the number of method entries [= # of new function names] > 16384))`
    For 3, you already have a solution (hopefully without introducing new config).
    
    Finally, we could control # of entries for methods in problems 2 and 3 without new config. What do you think?
    
    When we submit a PR, it is necessary to have test case. We hope that you can create it. If it is hard to write a program to cause the issue, you can directly call internal APIs (e.g. `CodeGeneratorSuite`).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 closed the pull request at:

    https://github.com/apache/spark/pull/19447


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143202186
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,13 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
    --- End diff --
    
    Can you ensure `SparkEnv.get` is always true?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    thank you for your comments @kiszk and @dongjoon-hyun, I changed a bit the approach according to a similar approach in the same file: https://github.com/mgaido91/spark/blob/c69be31314d9aa96c3920073beaf7cca46d507fa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1079
    
    Is it good to you like this?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    About which "number", which "limit" to set there, we want to see what problem happens in test cases or your precise description (what class for code generation causes a lot of small methods while the number of instance variables are not very large under a certain condition). Then, we would discuss what is a better solution.
    
    For the PR for the problem 3, I think that it would be good to update this PR or create a new PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19447
  
    I am closing this because as @kiszk pointed out in his comment, there is no reliable way to get `SQLConf` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143219623
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -934,6 +934,15 @@ object SQLConf {
           .intConf
           .createWithDefault(10000)
     
    +  val GENERATED_CLASS_LENGTH_THRESHOLD =
    +    buildConf("spark.sql.codegen.generatedClass.size.threshold")
    +      .doc("Threshold in bytes for the size of a generated class. If the generated class " +
    --- End diff --
    
    Please add `.internal()`. I think this is not a parameter for users.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143215289
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,13 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SparkEnv.get.conf.getInt(
    +      SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.key,
    +      SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD.defaultValue.get)
    --- End diff --
    
    Could you simplify the above three lines into the following?
    ```scala
    val generatedClassLengthThreshold = SparkEnv.get.conf.get(
      SQLConf.GENERATED_CLASS_LENGTH_THRESHOLD)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19447: [SPARK-22215][SQL] Add configuration to set the t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19447#discussion_r143540779
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -279,11 +279,11 @@ class CodegenContext {
           inlineToOuterClass: Boolean = false): String = {
         // The number of named constants that can exist in the class is limited by the Constant Pool
         // limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
    -    // threshold of 1600k bytes to determine when a function should be inlined to a private, nested
    -    // sub-class.
    +    // threshold to determine when a function should be inlined to a private, nested sub-class
    +    val generatedClassLengthThreshold = SQLConf.get.generatedClassLengthThreshold
    --- End diff --
    
    thanks for the interesting link. I assume using SparkEnv has the same problem, hasn't it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org