You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjinleekr <gi...@git.apache.org> on 2017/03/15 11:19:51 UTC

[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard

GitHub user dongjinleekr opened a pull request:

    https://github.com/apache/spark/pull/17303

    [SPARK-19112][CORE] add codec for ZStandard

    ## What changes were proposed in this pull request?
    
    Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.'
    
    [^1]: https://issues.apache.org/jira/browse/HADOOP-13578
    [^2]: https://issues.apache.org/jira/browse/HBASE-16710
    
    ## How was this patch tested?
    
    3 additional unit tests in `CompressionCodecSuite.scala`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjinleekr/spark feature/SPARK-19112

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17303
    
----
commit 1927b91d0d8621e9e2dc2a88a93e07780cfc66bf
Author: Lee Dongjin <do...@apache.org>
Date:   2017-03-15T08:09:56Z

    Implement ZStandardCompressionCodec

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by Cyan4973 <gi...@git.apache.org>.

Github user Cyan4973 commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    @maropu : What about compression ratios ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    @Cyan4973 I quickly checked again;
    ```
    scaleFactor: 4
    AWS instance: c4.4xlarge	
    
    // In this bench, I used `local-cluster` (`local` used in the benchmark above)
    ./bin/spark-shell --master local-cluster[4,4,7500] \
      --conf spark.driver.memory=1g \
      --conf spark.executor.memory=7g \
      --conf spark.io.compression.codec=xxx
    
    --- zstd (level=3)
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 36.517211838s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 25.026869575s                                                   
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 24.370711575s                                                   
    
    --- zstd (level=1)
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 29.654705815s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 20.638918335s
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 19.928730758999997s
    
    --- lz4
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 27.422360631s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 17.38519278s
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 15.779084563s
    
    --- snappy
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 27.476569521000002s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 16.438640631s                                                   
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 14.949329456s
    
    --- lzf
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 27.853010073s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 17.431232532000003s
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 15.916569896999999s
    ```
    `zstd` was still worse than the others.
    Not sure though, there might be the winner case where `zstd` overcomes the others in more larger data set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by Cyan4973 <gi...@git.apache.org>.

Github user Cyan4973 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17303#discussion_r115351567
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -215,3 +217,22 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "3").toInt
    --- End diff --
    
    Use cases which favor speed over size should prefer using level 1.
    Compression speed difference is fairly large.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    Same questions from last PR -- can this be something the user includes if needed or is there value in integrating it into Spark? where would it come into play and with what versions of Hadoop et al?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17303


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    OK, seems like we should close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    Yes it'd be nice to have some benchmark on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    I did quick benchmarks by using a TPCDS query (Q4) (I just referred the previous work in #10342)
    Based on the result, it seems it's a bit earlier to implement this;
    ```
    scaleFactor: 4
    AWS instance: c4.4xlarge	
    
    -- zstd
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 53.315878375s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 53.468174668s
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 57.282403146s 
    
    -- lz4
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 20.779643053s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 16.520911319s
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 15.897124967s
    
    -- snappy
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 21.132412036999998s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 15.908867743999998s                                             
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 15.789648712s
    
    -- lzf
    Running execution q4-v1.4 iteration: 1, StandardRun=true
    Execution time: 21.339518781s
    Running execution q4-v1.4 iteration: 2, StandardRun=true
    Execution time: 16.881225328s                                                   
    Running execution q4-v1.4 iteration: 3, StandardRun=true
    Execution time: 15.813455479s
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17303
  
    this should not be needed just to use to write to hdfs.  The regular hadoop input/output type formats have support for it if you are using the right version (I think hadoop 2.8).
    
    This seems to be adding the support to the spark.io.compression.codec for internal compression.  From what I've heard zstd is better then the other codecs since it gives Gzip level Compression with Lz4 level CPU usage.  So if you have  a job that had a ton of intermediate data or was causing network issues you may want to use ztsd to get the gzip compression levels without much cpu penalty.
    
     @dongjinleekr  It doesn't looks like you ran any manual tests on a real cluster?  It would be nice to have some basic performance/compression numbers to show it actually working.    Are you planning on actually using zstd in your spark deployment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org