You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Benjamin Kim <bb...@gmail.com> on 2017/02/13 17:48:18 UTC

Parquet Gzipped Files

We are receiving files from an outside vendor who creates a Parquet data file and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark and inject the Parquet data into a DataFrame? I thought using sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but I’m getting a decompression header error when trying to open the Parquet file.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Parquet Gzipped Files

Posted by Benjamin Kim <bb...@gmail.com>.

Jörn,

I agree with you, but the vendor is a little difficult to work with. For now, I will try to decompress it from S3 and save it plainly into HDFS. If someone already has this example, please let me know.

Cheers,
Ben


> On Feb 13, 2017, at 9:50 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> Your vendor should use the parquet internal compression and not take a parquet file and gzip it.
> 
>> On 13 Feb 2017, at 18:48, Benjamin Kim <bb...@gmail.com> wrote:
>> 
>> We are receiving files from an outside vendor who creates a Parquet data file and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark and inject the Parquet data into a DataFrame? I thought using sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but I’m getting a decompression header error when trying to open the Parquet file.
>> 
>> Thanks,
>> Ben
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Parquet Gzipped Files

Posted by Jörn Franke <jo...@gmail.com>.

Your vendor should use the parquet internal compression and not take a parquet file and gzip it.

> On 13 Feb 2017, at 18:48, Benjamin Kim <bb...@gmail.com> wrote:
> 
> We are receiving files from an outside vendor who creates a Parquet data file and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark and inject the Parquet data into a DataFrame? I thought using sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but I’m getting a decompression header error when trying to open the Parquet file.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org