You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by janardhan shetty <ja...@gmail.com> on 2016/07/25 00:34:25 UTC

Bzip2 to Parquet format

We have data in Bz2 compression format. Any links in Spark to convert into
Parquet and also performance benchmarks and uses study materials ?

Re: Bzip2 to Parquet format

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

This is the expected behaivour.
A default compression for parquet is `snappy`.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L215

// maropu

On Tue, Jul 26, 2016 at 6:33 AM, janardhan shetty <ja...@gmail.com>
wrote:

> Andrew,
>
> 2.0
>
> I tried
> val inputR = sc.textFile(file)
> val inputS = inputR.map(x => x.split("`"))
> val inputDF = inputS.toDF()
>
> inputDF.write.format("parquet").save(result.parquet)
>
> Result part files end with *.snappy.parquet *is that expected ?
>
> On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <an...@aehrlich.com>
> wrote:
>
>> You can load the text with sc.textFile() to an RDD[String], then use
>> .map() to convert it into an RDD[Row]. At this point you are ready to
>> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>>
>> Here is an example on how to define the StructType (schema) that you
>> will combine with the RDD[Row] to create a DataFrame.
>>
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>>
>> Once you have the DataFrame, save it to parquet with
>> dataframe.save(“/path”) to create a parquet file.
>>
>> Reference for SQLContext / createDataFrame:
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>>
>>
>>
>> On Jul 24, 2016, at 5:34 PM, janardhan shetty <ja...@gmail.com>
>> wrote:
>>
>> We have data in Bz2 compression format. Any links in Spark to convert
>> into Parquet and also performance benchmarks and uses study materials ?
>>
>>
>>
>


-- 
---
Takeshi Yamamuro

Re: Bzip2 to Parquet format

Posted by janardhan shetty <ja...@gmail.com>.

Andrew,

2.0

I tried
val inputR = sc.textFile(file)
val inputS = inputR.map(x => x.split("`"))
val inputDF = inputS.toDF()

inputDF.write.format("parquet").save(result.parquet)

Result part files end with *.snappy.parquet *is that expected ?

On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich <an...@aehrlich.com> wrote:

> You can load the text with sc.textFile() to an RDD[String], then use
> .map() to convert it into an RDD[Row]. At this point you are ready to
> apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)
>
> Here is an example on how to define the StructType (schema) that you will
> combine with the RDD[Row] to create a DataFrame.
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
>
> Once you have the DataFrame, save it to parquet with
> dataframe.save(“/path”) to create a parquet file.
>
> Reference for SQLContext / createDataFrame:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
>
>
>
> On Jul 24, 2016, at 5:34 PM, janardhan shetty <ja...@gmail.com>
> wrote:
>
> We have data in Bz2 compression format. Any links in Spark to convert into
> Parquet and also performance benchmarks and uses study materials ?
>
>
>

Re: Bzip2 to Parquet format

Posted by Andrew Ehrlich <an...@aehrlich.com>.

You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType)

Here is an example on how to define the StructType (schema) that you will combine with the RDD[Row] to create a DataFrame.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType>

Once you have the DataFrame, save it to parquet with dataframe.save(“/path”) to create a parquet file.

Reference for SQLContext / createDataFrame: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext>

> On Jul 24, 2016, at 5:34 PM, janardhan shetty <ja...@gmail.com> wrote:
> 
> We have data in Bz2 compression format. Any links in Spark to convert into Parquet and also performance benchmarks and uses study materials ?