You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2016/06/01 14:54:14 UTC

[GitHub] spark pull request #13439: [SPARK-15701][SQL] Constant ColumnVector only nee...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/13439

    [SPARK-15701][SQL] Constant ColumnVector only needs to prepare one capacity

    ## What changes were proposed in this pull request?
    
    `ColumnVector` has a variable to mark it is a constant `ColumnVector` or not. However, we still let constant `ColumnVector` prepare the space needed for all its capacity. Actually because constant `ColumnVector` only has one distinct value. It only needs to prepare one capacity. This can reduce its memory usage and speed up data access.
    
    ## How was this patch tested?
    Existing tests.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 constant-column-vector

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13439.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13439
    
----
commit 721d53ace9354b70e8b9c6decca0fb3b13c12426
Author: Liang-Chi Hsieh <si...@tw.ibm.com>
Date:   2016-06-01T14:39:56Z

    Support constant column vector.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59811/consoleFull)** for PR 13439 at commit [`b2c14ee`](https://github.com/apache/spark/commit/b2c14ee528cb93e0f077dbdf8681ee9ef790182a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59738/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59813/consoleFull)** for PR 13439 at commit [`0ec965f`](https://github.com/apache/spark/commit/0ec965ff7f55d780bd4bcee4a21123ba552eafc7).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Not Constant                           91 /   93          0.4        2224.7       1.0X   
        On Heap, Constant                               46 /   46          0.9        1120.7       2.0X   
        Off Heap, Not Constant                        1127 / 1295          0.0       27503.4       0.1X   
        Off Heap, Constant                             969 / 1008          0.0       23662.3       0.1X   
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @rxin hmm, I just think if we can improve it by just adding conditional check, it might be worth doing.
    
    For the performance hurt, this is benchmark for on-heap and off-heap column vectors before this patch:
    
    On Heap:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap                                         39 /   47          1.1         946.8       1.0X
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap                                         41 /   46          1.0         995.5       1.0X
    
    Off Heap:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap                                        65 /   75          0.6        1598.2       1.0X
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap                                        63 /   74          0.7        1532.5       1.0X 
    
    Looks like the performance is not hurt obviously/significantly.
    
    But if you still have concerns about this, we can close this.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Besides, I just wrote this test according to other tests in `ColumnarBatchBenchmark` that benchmark on-heap, off-heap column vector access. I was thinking it might be enough. If not, any else need to test further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    The latest benchmark is run individually for each type of column vector. As stated in `ColumnarBatchBenchmark`, it is hard to reason about the JIT. If we put these 4 cases together to run benchmark, the numbers seems not accurate and looks weird.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk Yea. I am going to remove `OffHeapConstantColumnVector` and `OnHeapConstantColumnVector`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Don't know who is best person to review this... cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @rxin Another implementation is to check if the column vector is constant and do corresponding logic in element access. Do it in code generation sounds interesting. As the column vector is not code generation now, do you mean to move it to code generation at all?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59738/consoleFull)** for PR 13439 at commit [`e0bdbed`](https://github.com/apache/spark/commit/e0bdbed7bca52a946d7d8344ee77f4ca15b45fe9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public final class OffHeapColumnVector extends OffHeapColumnVectorBase `
      * `public abstract class OffHeapColumnVectorBase extends ColumnVector `
      * `public final class OffHeapConstantColumnVector extends OffHeapColumnVectorBase `
      * `public final class OnHeapColumnVector extends OnHeapColumnVectorBase `
      * `public abstract class OnHeapColumnVectorBase extends ColumnVector `
      * `public final class OnHeapConstantColumnVector extends OnHeapColumnVectorBase `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Ok. I got it. So I think the point is the memory usage reduction is not worth doing this change. Let me close it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    I see. My question is, as for example we create 2 column vectors, one is constant and one is not. Because we will not re-use the column vectors, so their constant flag is fixed and not changed. As they are two different instances, will the problem you said happen? When `getInt` of first vector (constant) is called and later  `getInt` of the second (not constant) is called, the performance will be down?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #60140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60140/consoleFull)** for PR 13439 at commit [`07ef523`](https://github.com/apache/spark/commit/07ef523af03809837d1b73c3c8db56504f244fab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59814/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60141/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Oh, you are right. IMHO, it is too complex to introduce new implementation classes only for a column vector with the same value in all of the rows.
    To introduce compression schemes, as implemented in ```CachedBatch``` may be more generic solution if we introduce new implementation classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59815/consoleFull)** for PR 13439 at commit [`3c445ac`](https://github.com/apache/spark/commit/3c445aceff179c5bc7d6d78c0ddc8d4c097d2326).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #60140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60140/consoleFull)** for PR 13439 at commit [`07ef523`](https://github.com/apache/spark/commit/07ef523af03809837d1b73c3c8db56504f244fab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk 
    
    1. The allocation mechanism is not changed in this patch. Just for the column vector representing partition columns and null columns, this patch only allocates 1 row.
    
    2. I think the speed-up might come from simplified memory access. Because we don't need to scan the memory space for all values. For the constant column vectors, only one value is accessed. I don't know if elimination of index bound helps or not. I will try to do that in `OnHeapColumnVector` and benchmark them again to see the difference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk Yea. As @rxin said, only for reducing memory usage the current change is too much. I will change this to simple implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59814/consoleFull)** for PR 13439 at commit [`42e4dcd`](https://github.com/apache/spark/commit/42e4dcd8d34f09fbde60b4d3863e423cec8356bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Can you tell me what you are trying to address here? Are you seeing a problem with memory usage? I just don't know why this is important ... the number of partition columns is typically small.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59810/consoleFull)** for PR 13439 at commit [`93d1d08`](https://github.com/apache/spark/commit/93d1d08c540181b8e427345ab3902eead68ba2a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #60141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60141/consoleFull)** for PR 13439 at commit [`2226efc`](https://github.com/apache/spark/commit/2226efca5172e67a09f0972ef5ba110f7abce800).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @rxin I've updated this to more simple approach that doesn't introduce new classes. The main change is to check if the current vector is constant or not and do suitable data access. Please take a look. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk I've run some benchmark codes. The benchmark results:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Not Constant                           57 /   59          0.7        1395.4       1.0X
        On Heap, Constant                               44 /   44          0.9        1062.0       1.3X
        Off Heap, Not Constant                        1059 / 1092          0.0       25861.4       0.1X
        Off Heap, Constant                             998 / 1143          0.0       24373.4       0.1X



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Can we have a benchmark program to show performance improvement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk BTW, we can't simply do this by `ColumnarBatch.allocate` with `maxRows=1`. Because we still need to take care of element access. In other words, from outside, the vector looks like it has the same number of elements as other columns, not just 1 row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @viirya , thank you for preparing benchmark results. Let me clarify this implementation by two questions.
    
    1.  To reduce memory foot print, does this PR allocate an array for each column with the number of row instead of ``` DEFAULT_BATCH_SIZE```?
    2.  Why can this PR improve performance?  When I see ```getInt()``` both for ```OnHeapColumnVector.java``` and ```OnHeapConstantColumnVector.java```, they are almost similar (```intData[rowId]``` and ```intData[0]```). Does the performance improvement come from possible elimination of index bound check due to constant index access? Correct?
    If so, we could do the similar thing by using ```Platform.getInt()``` in ```OnHeapColumnVector.java```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    But how much performance can we gain in reality with this? The number of partition columns is usually not that large in practice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Going back to my original question: what's the point of this complicated pull request? How much memory would you save in practice? The column batches are not for persistent memory storage yet, and they are supposed to be only for a small number of rows.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59811/consoleFull)** for PR 13439 at commit [`b2c14ee`](https://github.com/apache/spark/commit/b2c14ee528cb93e0f077dbdf8681ee9ef790182a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    So will it be more practice to benchmark the case in which there are some constant and some not constant column vectors are used together? And compare it with the original case in which all columns are not with this extra branch (i.e., without this path)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59810/consoleFull)** for PR 13439 at commit [`93d1d08`](https://github.com/apache/spark/commit/93d1d08c540181b8e427345ab3902eead68ba2a6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Wouldn't this hurt performance even more due to the extra branch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59815/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @kiszk I've tried elimination of index bound in  `OnHeapColumnVector` by using `Platform.getInt()`. However, I do not see performance improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    @viirya this is still a pretty major change for unclear benefits. There might be other more important things that need more eyes on...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    **[Test build #59815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59815/consoleFull)** for PR 13439 at commit [`3c445ac`](https://github.com/apache/spark/commit/3c445aceff179c5bc7d6d78c0ddc8d4c097d2326).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Modify ColumnVector to reduce memory ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    What I meant is that if in one process you have some invocation of the function that would hit the true branch, and some other invocation of the function that would hit the false branch, the performance is going to be worse. Google "branch prediction" for more information.
    
    Basically you can't measure the overhead of an extra branch in practice by running a benchmark in which the flag is either always false or always true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Benchmark again on new change:
    
    Environment:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
    
    OnHeap, Not Constant:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Not Constant                           42 /   49          1.0        1020.3       1.0X
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Not Constant                           41 /   46          1.0         989.0       1.0X
    
    OnHeap, Constant:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Constant                               28 /   33          1.5         674.2       1.0X   
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        On Heap, Constant                               27 /   33          1.5         658.4       1.0X   
    
    OffHeap, Not Constant:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap, Not Constant                          63 /   73          0.6        1547.3       1.0X   
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap, Not Constant                          68 /   74          0.6        1663.5       1.0X   
    
    OffHeap, Constant:
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap, Constant                              27 /   33          1.5         662.5       1.0X
    
        ColumnVector R/W:                        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        ------------------------------------------------------------------------------------------------
        Off Heap, Constant                              27 /   33          1.5         657.1       1.0X
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/13439
  
    Yea. That is right. The problem is whether it is worth doing this change for the memory usage reduction and performance gain. If the change is relatively small (another implementation, not current one), I think it might be worth. Current change is complicated as you said. Code generation seems too much for this. So you think this patch is not worth doing? Or let me update this to another implementation and then you take a look if it is worth?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org