You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/02/07 03:22:00 UTC

[jira] [Commented] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

    [ https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354903#comment-16354903 ] 

Sean Owen commented on SPARK-23347:
-----------------------------------

GZipOutputStream is buffered already. As you say it implements the bulk write operation, not the single byte write. That's fine. The opposite is the problem for performance. This is especially not a problem in the case the output is already also buffered. I think this should be closed as a mistake.

> Introduce buffer between Java data stream and gzip stream
> ---------------------------------------------------------
>
>                 Key: SPARK-23347
>                 URL: https://issues.apache.org/jira/browse/SPARK-23347
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Ted Yu
>            Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>       ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>       GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides a write(byte[], offset, len) method, which calls the corresponding JNI zlib function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org