You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Haviv <da...@gmail.com> on 2014/11/19 09:41:56 UTC

Merging Parquet Files

Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile("/requests")
val parquetRequests=sqlContext.parquetFile("/requests_parquet")

jsonRequests.registerTempTable("jsonRequests")
parquetRequests.registerTempTable("parquetRequests")

val unified_requests=sqlContext.sql("select * from jsonRequests union
select * from parquetRequests")

unified_requests.saveAsParquetFile("/tempdir")

and then I delete /requests_parquet and rename /tempdir as /requests_parquet

Is there a better way to achieve that ?

Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?

Thank you,
Daniel

Re: Merging Parquet Files

Posted by Michael Armbrust <mi...@databricks.com>.

On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv <da...@gmail.com>
wrote:
>
> Another problem I have is that I get a lot of small json files and as a
> result a lot of small parquet files, I'd like to merge the json files into
> a few parquet files.. how I do that?
>

You can use `coalesce` on any RDD to merge files.

Re: Merging Parquet Files

Posted by Daniel Haviv <da...@gmail.com>.

Very cool thank you!


On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier <mp...@gmail.com> wrote:

> You can also insert into existing tables via .insertInto(tableName,
> overwrite). You just have to import sqlContext._
>
> On 19.11.2014, at 09:41, Daniel Haviv <da...@gmail.com> wrote:
>
> Hello,
> I'm writing a process that ingests json files and saves them a parquet
> files.
> The process is as such:
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val jsonRequests=sqlContext.jsonFile("/requests")
> val parquetRequests=sqlContext.parquetFile("/requests_parquet")
>
> jsonRequests.registerTempTable("jsonRequests")
> parquetRequests.registerTempTable("parquetRequests")
>
> val unified_requests=sqlContext.sql("select * from jsonRequests union
> select * from parquetRequests")
>
> unified_requests.saveAsParquetFile("/tempdir")
>
> and then I delete /requests_parquet and rename /tempdir as
> /requests_parquet
>
> Is there a better way to achieve that ?
>
> Another problem I have is that I get a lot of small json files and as a
> result a lot of small parquet files, I'd like to merge the json files into
> a few parquet files.. how I do that?
>
> Thank you,
> Daniel
>
>
>
>

Re: Merging Parquet Files

Posted by Marius Soutier <mp...@gmail.com>.

You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._

On 19.11.2014, at 09:41, Daniel Haviv <da...@gmail.com> wrote:

> Hello,
> I'm writing a process that ingests json files and saves them a parquet files.
> The process is as such:
> 
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val jsonRequests=sqlContext.jsonFile("/requests")
> val parquetRequests=sqlContext.parquetFile("/requests_parquet")
> 
> jsonRequests.registerTempTable("jsonRequests")
> parquetRequests.registerTempTable("parquetRequests")
> 
> val unified_requests=sqlContext.sql("select * from jsonRequests union select * from parquetRequests")
> 
> unified_requests.saveAsParquetFile("/tempdir")
> 
> and then I delete /requests_parquet and rename /tempdir as /requests_parquet
> 
> Is there a better way to achieve that ? 
> 
> Another problem I have is that I get a lot of small json files and as a result a lot of small parquet files, I'd like to merge the json files into a few parquet files.. how I do that?
> 
> Thank you,
> Daniel
> 
>