You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Devi P.V" <de...@gmail.com> on 2016/11/16 07:05:47 UTC

what is the optimized way to combine multiple dataframes into one dataframe ?

Hi all,

I have 4 data frames with three columns,

client_id,product_id,interest

I want to combine these 4 dataframes into one dataframe.I used union like
following

df1.union(df2).union(df3).union(df4)

But it is time consuming for bigdata.what is the optimized way for doing
this using spark 2.0 & scala


Thanks

Re: what is the optimized way to combine multiple dataframes into one dataframe ?

Posted by Deepak Sharma <de...@gmail.com>.
Can you try caching the individual dataframes and then union them?
It may save you time.

Thanks
Deepak

On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V <de...@gmail.com> wrote:

> Hi all,
>
> I have 4 data frames with three columns,
>
> client_id,product_id,interest
>
> I want to combine these 4 dataframes into one dataframe.I used union like
> following
>
> df1.union(df2).union(df3).union(df4)
>
> But it is time consuming for bigdata.what is the optimized way for doing
> this using spark 2.0 & scala
>
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

RE: what is the optimized way to combine multiple dataframes into one dataframe ?

Posted by Shreya Agarwal <sh...@microsoft.com>.
If you are reading all these datasets from files in persistent storage, functions like sc.textFile can take folders/patterns as input and read all of the files matching into the same RDD. Then you can convert it to a dataframe.

When you say it is time consuming with union, how are you measuring that? Did you try having all of them in one DF in comparison to having them broken down? Are you seeing a non-linear slowdown in operations after union with linear increase in data size?
Sent from my Windows 10 phone

From: Devi P.V<ma...@gmail.com>
Sent: Tuesday, November 15, 2016 11:06 PM
To: user @spark<ma...@spark.apache.org>
Subject: what is the optimized way to combine multiple dataframes into one dataframe ?

Hi all,

I have 4 data frames with three columns,

client_id,product_id,interest

I want to combine these 4 dataframes into one dataframe.I used union like following

df1.union(df2).union(df3).union(df4)

But it is time consuming for bigdata.what is the optimized way for doing this using spark 2.0 & scala


Thanks