You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/09/16 20:38:46 UTC

[jira] [Resolved] (SPARK-2496) Compression streams should write its codec info to the stream

     [ https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-2496.
-------------------------------
    Resolution: Incomplete

Resolving as "Incomplete"; if we still want to do this then we should wait until we have a specific concrete use-case / list of things that need to be changed.

> Compression streams should write its codec info to the stream
> -------------------------------------------------------------
>
>                 Key: SPARK-2496
>                 URL: https://issues.apache.org/jira/browse/SPARK-2496
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>            Reporter: Reynold Xin
>            Priority: Critical
>
> Spark sometime store compressed data outside of Spark (e.g. event logs, blocks in tachyon), and those data are read back directly using the codec configured by the user. When the codec differs between runs, Spark wouldn't be able to read the codec back. 
> I'm not sure what the best strategy here is yet. If we write the codec identifier for all streams, then we will be writing a lot of identifiers for shuffle blocks. One possibility is to only write it for blocks that will be shared across different Spark instances (i.e. managed outside of Spark), which includes tachyon blocks and event log blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org