You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (Jira)" <ji...@apache.org> on 2019/09/28 05:33:00 UTC
[jira] [Commented] (SPARK-29280) DataFrameReader should support a
compression option
[ https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939864#comment-16939864 ]
Nicholas Chammas commented on SPARK-29280:
------------------------------------------
cc [~hyukjin.kwon], [~cloud_fan]
> DataFrameReader should support a compression option
> ---------------------------------------------------
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 2.4.4
> Reporter: Nicholas Chammas
> Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter] supports a {{compression}} option, but [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] doesn't. The lack of a {{compression}} option in the reader causes some friction in the following cases:
> # You want to read some data compressed with a codec that Spark does not [load by default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
> # You want to read some data with a codec that overrides one of the built-in codecs that Spark supports.
> # You want to explicitly instruct Spark on what codec to use on read when it will not be able to correctly auto-detect it (e.g. because the file extension is [missing,|https://stackoverflow.com/q/52011697/877069] [non-standard|https://stackoverflow.com/q/44372995/877069], or [incorrect|https://stackoverflow.com/q/49110384/877069]).
> Case #2 came up in SPARK-29102. There is a very handy library called [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you load a single gzipped file using multiple concurrent tasks. (You can see the details of how it works and why it's useful in the project README and in SPARK-29102.)
> To use this codec, I had to set {{io.compression.codecs}}. I guess this is a Hadoop filesystem API setting, since it [doesn't appear to be documented by Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, there is also a setting called {{spark.io.compression.codec}}, which seems to be for a different purpose.
> It would be much clearer for the user and more consistent with the writer interface if the reader let you directly specify the codec.
> For example:
> {code:java}
> spark.read.option('compression', 'lz4').csv(...)
> spark.read.csv(..., compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org