You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "t oo (Jira)" <ji...@apache.org> on 2019/12/13 08:02:00 UTC

[jira] [Created] (SPARK-30251) faster way to read csv.gz?

t oo created SPARK-30251:
----------------------------

             Summary: faster way to read csv.gz?
                 Key: SPARK-30251
                 URL: https://issues.apache.org/jira/browse/SPARK-30251
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: t oo


some data providers give files in csv.gz (ie 1gb compressed which is 25gb uncompressed; or 5gb compressed which is 130gb compressed; or .1gb compressed which is 2.5gb uncompressed), now when i tell my boss that famous big data tool spark takes 16hrs to convert the 1gb compressed into parquet then there is look of shock. this is batch data we receive daily (80gb compressed, 2tb uncompressed every day spread across ~300 files).

i know gz is not splittable so currently loaded on single worker. but we dont have space/patience to do a pre-conversion to bz2 or uncompressed. can spark have a better codec? i saw posts mentioning even python is faster than spark

 

[https://stackoverflow.com/questions/40492967/dealing-with-a-large-gzipped-file-in-spark]

[https://github.com/nielsbasjes/splittablegzip]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org