You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kiszk <gi...@git.apache.org> on 2017/05/19 09:36:11 UTC

[GitHub] spark pull request #18033: Add compression/decompression of column data to C...

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/18033

    Add compression/decompression of column data to ColumnVector

    ## What changes were proposed in this pull request?
    
    This PR adds compression/decompression of column data to `ColumnVector`. 
    While current `CachedBatch` can compress column data by using of multiple compression schemes, `ColumnVector` cannot compress column data. The compression is mandatory for table cache.
    
    At first, this PR enables `RunLengthEncoding` for boolean/byte/short/int/long and `BooleanBitSet` for boolean. Another JIRA will support comrpession schemes.
    
    At high level view, when `ColumnVector.compress()` is called, compression is performed from an array for primitive data type to byte array in `ColumnVector`. When `ColumnVector.decompress()` is called, decompression is performed from the byte array to the array for primitive data type to byte array in `ColumnVector`. For these compression/decompression, `ArrayBuffer` is used for accessing data.
    
    
    This PR added and changed the following APIs:
    
    `ArrayBuffer`
    * This new class is similar to `java.io.ByteBuffer`. `ArrayBuffer` class can wrap an array for any primitive data type such as `Array[Int]` or `Array[Long]`. This class manages current position to be accessed.
    
    `ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer: ArrayBuffer)`
    * These APIs gets a primitive value from the current position or puts a primitive value into the current position at the given `ArrayBuffer`. 
    
    `Encoder.gatherCompressibilityStats(in: ArrayBuffer)`
    * This API calculates uncompressed and compressed size by using a given compression method.
    
    `Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit`
    * This API compresses data in `from` and stores compressed data to `to`. `to` has to have an byte array with enough size for compressed data.
    
    `Decoder.decompress(values: ArrayBuffer): Unit`
    * This API decompresses data in `Decoder` by providing its constructor and stores uncompressed data to `values`. `to` has to have an byte array with enough size for uncompressed data.
    
    ## How was this patch tested?
    
    Added new test suites

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-20807

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18033.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18033
    
----
commit 6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a
Author: Kazuaki Ishizaki <is...@jp.ibm.com>
Date:   2017-05-19T09:33:38Z

    initial commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    **[Test build #77091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77091/testReport)** for PR 18033 at commit [`6d5497e`](https://github.com/apache/spark/commit/6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    @hvanhovell would it be possible to review this or let us know the appropriate persons for this review?
    cc @sameeragarwal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77091/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    **[Test build #77092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77092/testReport)** for PR 18033 at commit [`193a71b`](https://github.com/apache/spark/commit/193a71bb30cd38c5ca3d3c234bf2f1e2b8210f11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    ping @hvanhovell @sameeragarwal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18033: [SPARK-20807][SQL] Add compression/decompression ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk closed the pull request at:

    https://github.com/apache/spark/pull/18033


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    **[Test build #77091 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77091/testReport)** for PR 18033 at commit [`6d5497e`](https://github.com/apache/spark/commit/6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class ArrayBuffer(array: Array[_]) `
      * `class ColumnVectorCompressionBuilder[T <: AtomicType](dataType: T) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    ping @hvanhovell @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    **[Test build #77092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77092/testReport)** for PR 18033 at commit [`193a71b`](https://github.com/apache/spark/commit/193a71bb30cd38c5ca3d3c234bf2f1e2b8210f11).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    ping @hvanhovell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    @hvanhovell Could you please take a look?  cc @sameeragarwal 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18033
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77092/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org