You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sai Prasanna <an...@gmail.com> on 2014/04/28 08:41:47 UTC

Spark with Parquet

Hi All,

I want to store a csv-text file in Parquet format in HDFS and then do some
processing in Spark.

Somehow my search to find the way to do was futile. More help was available
for parquet with impala.

Any guidance here? Thanks !!

Re: Spark with Parquet

Posted by Mohit Jaggi <mo...@gmail.com>.

something like this should work….

val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema if the guessed schema is not accurate
df.write.parquet(“myfile.parquet”)


Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com




> On Apr 27, 2014, at 11:41 PM, Sai Prasanna <an...@gmail.com> wrote:
> 
> Hi All,
> 
> I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark.
> 
> Somehow my search to find the way to do was futile. More help was available for parquet with impala. 
> 
> Any guidance here? Thanks !!
>

Re: Spark with Parquet

Posted by Matei Zaharia <ma...@gmail.com>.

Spark uses the Hadoop InputFormat and OutputFormat classes, so you can simply create a JobConf to read the data and pass that to SparkContext.hadoopFile. There are some examples for Parquet usage here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ and here: http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark.

Matei

On Apr 27, 2014, at 11:41 PM, Sai Prasanna <an...@gmail.com> wrote:

> Hi All,
> 
> I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark.
> 
> Somehow my search to find the way to do was futile. More help was available for parquet with impala. 
> 
> Any guidance here? Thanks !!
>

Re: Spark with Parquet

Posted by shamu <pr...@hotmail.com>.

Create a hive table x
Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;)

create hive table y with same structure as x except add STORED AS PARQUET; 
INSERT OVERWRITE TABLE y SELECT * FROM x;


This would get you parquet files under /user/hive/warehouse/y (as an
example) you can use this file path for your processing... 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Parquet-tp4923p27584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org