You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjinleekr <gi...@git.apache.org> on 2017/03/15 11:19:51 UTC
[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard
GitHub user dongjinleekr opened a pull request:
https://github.com/apache/spark/pull/17303
[SPARK-19112][CORE] add codec for ZStandard
## What changes were proposed in this pull request?
Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.'
[^1]: https://issues.apache.org/jira/browse/HADOOP-13578
[^2]: https://issues.apache.org/jira/browse/HBASE-16710
## How was this patch tested?
3 additional unit tests in `CompressionCodecSuite.scala`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dongjinleekr/spark feature/SPARK-19112
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17303.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17303
----
commit 1927b91d0d8621e9e2dc2a88a93e07780cfc66bf
Author: Lee Dongjin <do...@apache.org>
Date: 2017-03-15T08:09:56Z
Implement ZStandardCompressionCodec
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by Cyan4973 <gi...@git.apache.org>.
Github user Cyan4973 commented on the issue:
https://github.com/apache/spark/pull/17303
@maropu : What about compression ratios ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/17303
@Cyan4973 I quickly checked again;
```
scaleFactor: 4
AWS instance: c4.4xlarge
// In this bench, I used `local-cluster` (`local` used in the benchmark above)
./bin/spark-shell --master local-cluster[4,4,7500] \
--conf spark.driver.memory=1g \
--conf spark.executor.memory=7g \
--conf spark.io.compression.codec=xxx
--- zstd (level=3)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 36.517211838s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 25.026869575s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 24.370711575s
--- zstd (level=1)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 29.654705815s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 20.638918335s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 19.928730758999997s
--- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.422360631s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.38519278s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.779084563s
--- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.476569521000002s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.438640631s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 14.949329456s
--- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.853010073s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.431232532000003s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.916569896999999s
```
`zstd` was still worse than the others.
Not sure though, there might be the winner case where `zstd` overcomes the others in more larger data set.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by Cyan4973 <gi...@git.apache.org>.
Github user Cyan4973 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17303#discussion_r115351567
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -215,3 +217,22 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
}
}
}
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be compatible across versions
+ * of Spark. This is intended for use as an internal compression utility within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+ override def compressedOutputStream(s: OutputStream): OutputStream = {
+ val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "3").toInt
--- End diff --
Use cases which favor speed over size should prefer using level 1.
Compression speed difference is fairly large.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17303
Same questions from last PR -- can this be something the user includes if needed or is there value in integrating it into Spark? where would it come into play and with what versions of Hadoop et al?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/17303
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17303
OK, seems like we should close this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/17303
Yes it'd be nice to have some benchmark on this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:
https://github.com/apache/spark/pull/17303
I did quick benchmarks by using a TPCDS query (Q4) (I just referred the previous work in #10342)
Based on the result, it seems it's a bit earlier to implement this;
```
scaleFactor: 4
AWS instance: c4.4xlarge
-- zstd
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 53.315878375s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 53.468174668s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 57.282403146s
-- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 20.779643053s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.520911319s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.897124967s
-- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.132412036999998s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 15.908867743999998s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.789648712s
-- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.339518781s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.881225328s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.813455479s
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17303
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard
Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17303
this should not be needed just to use to write to hdfs. The regular hadoop input/output type formats have support for it if you are using the right version (I think hadoop 2.8).
This seems to be adding the support to the spark.io.compression.codec for internal compression. From what I've heard zstd is better then the other codecs since it gives Gzip level Compression with Lz4 level CPU usage. So if you have a job that had a ton of intermediate data or was causing network issues you may want to use ztsd to get the gzip compression levels without much cpu penalty.
@dongjinleekr It doesn't looks like you ran any manual tests on a real cluster? It would be nice to have some basic performance/compression numbers to show it actually working. Are you planning on actually using zstd in your spark deployment?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org