You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sitalkedia <gi...@git.apache.org> on 2017/08/02 00:46:06 UTC

[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/18805

    [SPARK-19112][CORE] Support for ZStandard codec

    ## What changes were proposed in this pull request?
    
    Using zstd compression for Spark jobs spilling 100s of TBs of data, we could reduce the amount of data written to disk by as much as 50%. This translates to significant latency gain because of reduced disk io operations. There is a degradation CPU time by 2 - 5% because of zstd compression overhead, but for jobs which are bottlenecked by disk IO, this hit can be taken. 
    
    ## How was this patch tested?
    
    Tested by running few jobs spilling large amount of data on the cluster and amount of intermediate data written to disk reduced by as much as 50%.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark skedia/upstream_zstd

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18805
    
----
commit cff558b6873a8ec184159a9df3c1e83c9cd0a6e7
Author: Sital Kedia <sk...@fb.com>
Date:   2017-08-02T00:41:27Z

    [SPARK-19112][CORE] Support for ZStandard codec

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    oh it might actually fall into that as it has the patents:
    https://github.com/facebook/zstd/blob/dev/PATENTS


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Hm, this might be a real error. It seems to be hanging at:
    
    ```
    [info] ExternalAppendOnlyMapSuite:
    ...
    [info] - simple cogroup (56 milliseconds)
    [info] - spilling (3 seconds, 678 milliseconds)
    9/21/17 1:43:14 AM =============================================================
    
    -- Gauges ----------------------------------------------------------------------
    master.aliveWorkers
    9/21/17 2:03:14 AM =============================================================
    
    -- Gauges ----------------------------------------------------------------------
    master.aliveWorkers
    9/21/17 2:23:14 AM =============================================================
    ...
    ```
    
    However I can't figure out how that would be related. This new codec isn't enabled anywhere except its test. Hm, I can run it again tomorrow in case it something else transient, but it did fail this way twice and nothing else did.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merging to master. Thanks for seeing this through!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    > Let me know if we are fine using https://github.com/luben/zstd-jni library which uses BSD license. 
    
    Isn't the license of the native library (not the JNI wrapper) the same or very similar to the RocksDB license that was recently blacklisted by the ASF? That would make it hard to add it to Spark.
    
    (RocksDB has since moved to ASLv2.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    The [code](https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/util/Native.java) overwrites the original exception message that might shed some light on what's going on... and also ignores some exceptions it shouldn't be ignoring (like errors on close, which may indicate low disk space).
    
    Anyway, let's try again to see if it's at least consistent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82729/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Any benchmark data?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    It's dual-licensed (see bottom of https://github.com/facebook/zstd/blob/v1.3.1/README.md ) and one is 3-clause BSD. This is fine as it's a "Category A" license: https://www.apache.org/legal/resolved.html  
    
    The zstd-jni binding itself is 2-clause BSD, also OK. https://github.com/luben/zstd-jni/blob/master/LICENSE
    
    Licenses check out and there's no extra dependencies it drags in.
    
    I think this will be OK but we do need to add these two licenses to `licenses/` (see the convention there) and also add a line for each in `LICENSE` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Yeah but that would also cause it to fail locally if it were the cause, and it passes for me. I can't really figure out from the rest of the logs if something obvious is wrong, so I guess the best bet now is to as for changes in the `zstd-jni` so that all errors are properly reported (see https://github.com/apache/spark/pull/18805#issuecomment-335885053).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130892528
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    +    val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    Would it be better to have this variable as a private variable to get this property only once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Any idea what is the build failure about?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80148/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130787287
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80144/testReport)** for PR 18805 at commit [`287a9da`](https://github.com/apache/spark/commit/287a9daa98eee7a130f09724209978fac90baf2d).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130787269
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83204/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    This seems to be caused by a issue in the `zstd-jni` library. It probably uses the wrong `ClassLoader` to load the native library, and as a result it cannot find the library & load it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82911/testReport)** for PR 18805 at commit [`2580633`](https://github.com/apache/spark/commit/25806338814ca5b35f9979c84443a31dab54a133).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82911/testReport)** for PR 18805 at commit [`2580633`](https://github.com/apache/spark/commit/25806338814ca5b35f9979c84443a31dab54a133).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82608 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82608/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r148184755
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    done.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80141 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80141/testReport)** for PR 18805 at commit [`4ee4d2b`](https://github.com/apache/spark/commit/4ee4d2b3fe9e6e295e3896208a8b800eff976dd5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    re build failure: you can repro that locally by running "./dev/test-dependencies.sh". Its failing due to introducing a new dep... you need to add it to `dev/deps/spark-deps-hadoop-XXX`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Our compression codec is actually completely decoupled from Hadoops, but dependency management (and licensing) can be annoying to deal with.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82608 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82608/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80142 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80142/testReport)** for PR 18805 at commit [`287a9da`](https://github.com/apache/spark/commit/287a9daa98eee7a130f09724209978fac90baf2d).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    >> I think this will be OK but we do need to add these two licenses to licenses/ (see the convention there) and also add a line for each in LICENSE here.
    
    @srowen - Does that need to be done with this PR? What are the next steps for this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    How big is the dependency that's getting pulled in? If we are adding more compression codecs maybe we should retire some old ones, or move them into a separate package so downstream apps can optionally depend on them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    @rxin - Updated with benchmark data on our production workload.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    ping.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r147340655
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    --- End diff --
    
    Is this `getInt` instead of `getSizeAsBytes`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    (I'll file a bug and send a PR for it separately, btw.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80148/testReport)** for PR 18805 at commit [`295f38a`](https://github.com/apache/spark/commit/295f38a808dfdbbba94a83a21708b0597327d195).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80148/testReport)** for PR 18805 at commit [`295f38a`](https://github.com/apache/spark/commit/295f38a808dfdbbba94a83a21708b0597327d195).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80142 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80142/testReport)** for PR 18805 at commit [`287a9da`](https://github.com/apache/spark/commit/287a9daa98eee7a130f09724209978fac90baf2d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Got it, thanks for the reminder. I think the question is mostly about license and dependency weight then. I think we'd want to use whatever Hadoop provides.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #81967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81967/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80144/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80142/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3930 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3930/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130769858
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
    +    val compressionBuffer = conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt
    --- End diff --
    
    - wondering if we should share this config value OR have a new one.
    - do you want to set the default to something higher like 1mb or 4mb ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #83282 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83282/testReport)** for PR 18805 at commit [`95e6b8b`](https://github.com/apache/spark/commit/95e6b8bbda5404ba25f0fe84cdd59562f4f8f481).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82729/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    The patents grant (`PATENTS` file) seems to have been dropped from the repo:
    https://github.com/facebook/zstd
    
    That makes it plain BSD, which is fine for Spark.
    (https://github.com/facebook/zstd/blob/dev/LICENSE)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Hi, @sitalkedia .
    
    facebook/zstd#775 was `Consider re-licensing to Apache License v2`.
    
    But, I found that the new one is not Apache License v2. It's about adding GPLv2. I'm wondering if  GPLv2 is enough for us.
    
    For [Apache licensing](https://www.apache.org/licenses/GPL-compatibility.html), I found the following.
    > Despite our best efforts, the FSF has never considered the Apache License to be compatible with GPL version 2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82644/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Yes, the native library has the BSD+patents license:
    https://github.com/luben/zstd-jni/tree/master/src/main/native


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3938 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3938/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    @srowen - Did you get a chance to look into this ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    > Why does this need to be in Spark? 
    @srowen  you already asked that question and it has been answered on the jira as well as the old pr.  A user cannot add zstd compression to the internal spark parts:   spark.io.compression.codec. In this particular case he is saying its the shuffle output where its making a big difference.
    zstd is already included in other open source projects like Hadoop, but again we don't get that for Spark internal compression code, zstd itself is BSD license. It looks like this pr is using the https://github.com/luben/zstd-jni which also appears to be BSD licensed.   We need to decide if using that is ok for us to use directly.  Hadoop wrote its own version but I would say if that version is working we use it.  Worse case if something happens where that user won't fix something we could fork it and aren't any worse then having our own copy to start with.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    https://www.apache.org/legal/resolved.html
    
    Seems to list bsd-2 and bsd-3 as ok. https://github.com/facebook/zstd/blob/dev/LICENSE  
    not sure how the Facebook BSD+Patents license was showing up so we need to make sure it doesn't fall into the "Facebook BSD+Patents license" category 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Yes, the binary distribution is included in the zstd-jni jar file.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Turns out that's caused by SparkContext failing to clean up after itself when the `UnsatisfiedLinkError` happens, so those errors are red herrings...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82729 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82729/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    It looks like zstd-jni has now been updated to pull 1.3.1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    @vanzin - May be the test time outs are related to one test failure -https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82608/testReport/org.apache.spark.scheduler/ReplayListenerSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
    
    Its failing to load the native library -
    
    ```building from source the jar or providing libzstd-jni in you system.
    	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
    	at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    	at java.lang.System.loadLibrary(System.java:1122)
    	at com.github.luben.zstd.util.Native.load(Native.java:101)
    	at com.github.luben.zstd.ZstdOutputStream.<clinit>(ZstdOutputStream.java:17)
    	at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:244)
    	at org.apache.spark.scheduler.EventLoggingListener$$anonfun$2.apply(EventLoggingListener.scala:122)
    	at org.apache.spark.scheduler.EventLoggingListener$$anonfun$2.apply(EventLoggingListener.scala:122)
    	at scala.Option.map(Option.scala:146)
    	at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:122)
    	at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
    	at org.apache.spark.SparkContext.<init>(SparkContext.scala:127)
    	at ```
     
    Unfortunately, I am not able to repro this issue on my laptop. Any idea what might be wrong?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130769548
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    --- End diff --
    
    would be good to add this link to the spec : http://facebook.github.io/zstd/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3927 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3927/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #82644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82644/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    https://github.com/facebook/zstd/issues/775


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Same test failed, so looks like there's a real non-infra-related issue...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Just an update on this - I am in talk with our internal team to relicense zstd library. This might take some time though. I will keep you updated.
    
    @discipleforteen - Unfortunately, we do not have setup to run TPDCS benchmarks.  Feel free to run the benchmark on this change and report the number if you have the setup readily available. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82608/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80141/testReport)** for PR 18805 at commit [`4ee4d2b`](https://github.com/apache/spark/commit/4ee4d2b3fe9e6e295e3896208a8b800eff976dd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r147215825
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    +    val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    Sorry somehow missed these comments. Will address.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    cc @dongjinleekr too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    cc - @srowen, @tgravescs, @rxin, @sameeragarwal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    @rxin - Sure, let me talk to folks internally to see if it is possible to relicense. Otherwise, we might have to upgrade to hadoop 2.9.0, which will come with its own zstd implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130787205
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
    +    val compressionBuffer = conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt
    --- End diff --
    
    You are right, we should not share the config with lz4, created a new one.
    Lets keep the default to 32kb which is aligned with the block size used by other compressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by discipleforteen <gi...@git.apache.org>.
Github user discipleforteen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    how about TPCDS ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Does the package include a binary distribution for Linux?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Created https://github.com/luben/zstd-jni/issues/47. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #83204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83204/testReport)** for PR 18805 at commit [`eba3024`](https://github.com/apache/spark/commit/eba30249108f195a44dddd42fb8cae35d5f02f5f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130787262
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
     
       private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = {
         (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec]
    -      || codec.isInstanceOf[LZ4CompressionCodec])
    +      || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec])
       }
     
       private val shortCompressionCodecNames = Map(
         "lz4" -> classOf[LZ4CompressionCodec].getName,
         "lzf" -> classOf[LZFCompressionCodec].getName,
    -    "snappy" -> classOf[SnappyCompressionCodec].getName)
    +    "snappy" -> classOf[SnappyCompressionCodec].getName,
    +    "zstd" -> classOf[SnappyCompressionCodec].getName)
    --- End diff --
    
    Ah, my bad. Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #81911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81911/testReport)** for PR 18805 at commit [`38d4840`](https://github.com/apache/spark/commit/38d484095cb404e3aa2e05633089e087d3a9d2dc).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80140/testReport)** for PR 18805 at commit [`cff558b`](https://github.com/apache/spark/commit/cff558b6873a8ec184159a9df3c1e83c9cd0a6e7).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Thank you for confirming, @vanzin ! I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Please note that few minor improvements I have made when comapring to old PR - #17303
    1. Use zstd compression level 1 instead of 3, which is significantly faster.
    2. Wrap the zstd input/output stream in buffered input/output stream to avoid overhead of excessive JNI call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81967/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Thanks for looking into this @srowen. Its weird, I dont understand that either. Also, I am not able to reproduce this issue on my laptop.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r139898945
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    +    val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    Agree, it's simpler and cleaner, as it avoids duplicating this property in this file


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r139634333
  
    --- Diff: dev/deps/spark-deps-hadoop-2.6 ---
    @@ -186,3 +186,4 @@ xercesImpl-2.9.1.jar
     xmlenc-0.52.jar
     xz-1.0.jar
     zookeeper-3.4.6.jar
    +zstd-jni-1.3.0-1.jar
    --- End diff --
    
    These need to be updated, it seems


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r147618796
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    --- End diff --
    
    Good eye, fixed. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #81967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81967/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    I haven't been able to reproduce the issue locally, but looking at the jenkins logs I see a bunch of exceptions like these:
    
    ```
    17/10/13 06:53:26.609 dispatcher-event-loop-15 ERROR Worker: Failed to launch executor app-20171013030138-0000/3 for Test replay.
    java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
            at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
    ```
    
    And:
    
    ```
    17/10/13 06:53:26.687 pool-1-thread-1-ScalaTest-running-ExternalAppendOnlyMapSuite WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
    org.apache.spark.SparkContext.<init>(SparkContext.scala:127)
    org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$12.apply$mcV$sp(ExternalAppendOnlyMapSuite.scala:30
    ```
    
    Note that the first error mentions the app name used by `ReplayListenerSuite` but it actually happens in a completely separate test suite. At the very least, `ReplayListenerSuite` is doing a poor job of cleaning up after itself and we should fix that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80141/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3930 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3930/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Relevant PR - https://github.com/facebook/zstd/pull/801/files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Why does this need to be in Spark? and what are the licensing terms of the native code underneath (just suspicious because it's often GPL)? can a user not just add this with their app?
    
    I tend to think we support what Hadoop supports for us here. Doesn't a later Hadoop pull this in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130769482
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
     
       private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = {
         (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec]
    -      || codec.isInstanceOf[LZ4CompressionCodec])
    +      || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec])
       }
     
       private val shortCompressionCodecNames = Map(
         "lz4" -> classOf[LZ4CompressionCodec].getName,
         "lzf" -> classOf[LZFCompressionCodec].getName,
    -    "snappy" -> classOf[SnappyCompressionCodec].getName)
    +    "snappy" -> classOf[SnappyCompressionCodec].getName,
    +    "zstd" -> classOf[SnappyCompressionCodec].getName)
    --- End diff --
    
    you mean `ZStandardCompressionCodec` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    ```
    $ ldd linux/amd64/libzstd-jni.so 
    ldd: warning: you do not have execution permission for `linux/amd64/libzstd-jni.so'
    linux/amd64/libzstd-jni.so: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by linux/amd64/libzstd-jni.so)
            linux-vdso.so.1 =>  (0x00007ffe0dfda000)
            libc.so.6 => /lib64/libc.so.6 (0x00007f89eb3a4000)
            /lib64/ld-linux-x86-64.so.2 (0x0000003612e00000)
    ```
    
    Mystery solved; library is compiled with a newer glibc requirement than the amplab machines have. Can we ask them to tweak their compilation to support older Linux distros?
    
    ```
    $ cat /etc/issue
    CentOS release 6.9 (Final)
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    In addition to LICENSE, there is also COPYING in the v1.3.1 release:
    
    https://github.com/facebook/zstd/blob/v1.3.1/LICENSE
    https://github.com/facebook/zstd/blob/v1.3.1/COPYING
    
    I'm not going to pretend to have a fully informed opinion on this issue.
    
    On Mon, Aug 21, 2017 at 11:18 AM, Dongjoon Hyun <no...@github.com>
    wrote:
    
    > Thank you for confirming, @vanzin <https://github.com/vanzin> ! I see.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/18805#issuecomment-323815495>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAZ4-pqXm-R2schCRbHQawxygQGS9-q2ks5sacnxgaJpZM4Oqd2u>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #83282 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83282/testReport)** for PR 18805 at commit [`95e6b8b`](https://github.com/apache/spark/commit/95e6b8bbda5404ba25f0fe84cdd59562f4f8f481).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Yes, licenses have to be updated, that's the one type of thing that's not optional.
    But Marcelo is right that the library actually doesn't yet include the newer dependency with the right license. We can't pull it in until it pulls in 1.3.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Updated with zstd-jni versin 1.3.1-1 and also updated the license to include zstd-jni license. @srowen - How does that look from licensing prospective?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r139635152
  
    --- Diff: LICENSE ---
    @@ -269,6 +269,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt.
          (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE)
          (BSD 3 Clause) DPark (https://github.com/douban/dpark/blob/master/LICENSE)
          (BSD 3 Clause) CloudPickle (https://github.com/cloudpipe/cloudpickle/blob/master/LICENSE)
    +     (BSD 2 Clause) Zstd-jni (https://github.com/luben/zstd-jni/blob/master/LICENSE)
    --- End diff --
    
    ZStandard itself is also included, not just zstd-jni. We'd need an entry for it, and its LICENSE as below.
    https://github.com/luben/zstd-jni/tree/master/src/main/native


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    We have released new zstd version (https://github.com/facebook/zstd/releases) with modified BSD + GPLv2 license. 
    
    @rxin, @srowen, @markhamstra - Can you confirm this looks fine from licensing perspective and this PR is good to go?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82911/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    @sitalkedia anyway you can talk to the FB team that does that one and relicense, similar to RocksDB?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18805


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130889423
  
    --- Diff: docs/configuration.md ---
    @@ -866,7 +866,8 @@ Apart from these, the following properties are also available, and may be useful
         e.g.
         <code>org.apache.spark.io.LZ4CompressionCodec</code>,
         <code>org.apache.spark.io.LZFCompressionCodec</code>,
    -    and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
    +    <code>org.apache.spark.io.SnappyCompressionCodec</code>.
    --- End diff --
    
    nit: '</code>.' -> '</code>,'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80140/testReport)** for PR 18805 at commit [`cff558b`](https://github.com/apache/spark/commit/cff558b6873a8ec184159a9df3c1e83c9cd0a6e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130889998
  
    --- Diff: docs/configuration.md ---
    @@ -886,6 +887,23 @@ Apart from these, the following properties are also available, and may be useful
       </td>
     </tr>
     <tr>
    +  <td><code>spark.io.compression.zstd.level</code></td>
    +  <td>1</td>
    +  <td>
    +    Compression leve for Zstd compression codec. Increasing the compression level will result in better
    --- End diff --
    
    nit: leve -> level


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #81911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81911/testReport)** for PR 18805 at commit [`38d4840`](https://github.com/apache/spark/commit/38d484095cb404e3aa2e05633089e087d3a9d2dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80140/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81911/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Just to make sure, this is ok if the JNI wrapper does not bring the zstd native library with it; otherwise it also needs to be updated since it still has the old BSD+PATENTS licence (https://github.com/luben/zstd-jni/tree/master/src/main/native), and Spark needs to pull in the version with the correct license.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #83204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83204/testReport)** for PR 18805 at commit [`eba3024`](https://github.com/apache/spark/commit/eba30249108f195a44dddd42fb8cae35d5f02f5f).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r147213697
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    // Default compression level for zstd compression to 1 because it is
    +    // fastest of all with reasonably high compression ratio.
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", "1").toInt
    +    val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    @sitalkedia how about comments like this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    I didn't look into it. I can restart the test here though


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    >> How big is the dependency that's getting pulled in?
    
    zstd-jni library actually is a very thin library and is not pulling any dependency of its own, so I would not be worried about dependency management issues. 
    
    Let me know if we are fine using https://github.com/luben/zstd-jni library which uses BSD license. Otherwise, we might have to implement something of our own. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Good news is that I can reproduce it on the amplab machine, so I'll try to play around with the zstd-jni code a bit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    In `Benchmark` section the values for `Lz4` are all zeros which feels confusing while reading.. first thing I thought is they were absolute values but they are supposed to be relative


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by tejasapatil <gi...@git.apache.org>.
Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r130769646
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  override def compressedOutputStream(s: OutputStream): OutputStream = {
    +    val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
    --- End diff --
    
    please add a comment explaining the reason why we chose level 1 over other levels


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3927 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3927/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83282/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Old PR - https://github.com/apache/spark/pull/17303


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #80144 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80144/testReport)** for PR 18805 at commit [`287a9da`](https://github.com/apache/spark/commit/287a9daa98eee7a130f09724209978fac90baf2d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    **[Test build #3938 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3938/testReport)** for PR 18805 at commit [`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18805#discussion_r147629043
  
    --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
    @@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou
         }
       }
     }
    +
    +/**
    + * :: DeveloperApi ::
    + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. For more
    + * details see - http://facebook.github.io/zstd/
    + *
    + * @note The wire protocol for this codec is not guaranteed to be compatible across versions
    + * of Spark. This is intended for use as an internal compression utility within a single Spark
    + * application.
    + */
    +@DeveloperApi
    +class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
    +
    +  val bufferSize = conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
    --- End diff --
    
    This should be private. The intent was to lift both config values out of the method, so level can do here too.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18805
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82644/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org