You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Durgapal <du...@gmail.com> on 2020/05/18 16:57:18 UTC

How to split a dataframe into two dataframes based on count

Dear All,

I would like to know how, in spark 2.0, can I split a dataframe into two
dataframes when I know the exact counts the two dataframes should have. I
tried using limit but got quite weird results. Also, I am looking for exact
counts in child dfs, not the approximate % based split.

*Following is what I have tried:*

var dfParent = sc.read.parquet("somelocation");// let's say it has 4000 rows

I want to split the parent into two dfs with the following counts:

var dfChild1Count = 1000

var dfChild2Count = 3000

*I tried this: *

var dfChild1 = dfParent.limit(dfChild1Count);

var dfChild2 = dfParent.except(dfChild1);

*and wrote that to output hdfs directories:*

dfChild1.write.parquet("/outputfilechild1");

dfChild2.write.parquet("/outputfilechild2");

It turns out this results in some duplicates saved in
files outputfilechild1 & outputfilechild2. Could anyone explain why they
have duplicates?

When I sorted my parent dataframe before limit, it then worked fine:


*dfParent = dfParent.sortBy(col("unique_col").desc())*
Seems like the limit on parent is executed twice and return different
records each time. Not sure why it is executed twice when I mentioned only
once.

Also, Is there a better way to split a df into multiple dfs when we know
exact counts of the child dfs?




Regards,
Mohit

Re: How to split a dataframe into two dataframes based on count

Posted by Vipul Rajan <vi...@gmail.com>.

Hi Mohit,

"Seems like the limit on parent is executed twice and return different
records each time. Not sure why it is executed twice when I mentioned only
once"

That is to be expected. Since spark follows lazy evaluation, which means
that execution only happens when you call an action, every action would
result in every step being processed again (mostly, some steps are
automatically cached and skipped). You can try using this

var dfChild1 = dfParent.limit(dfChild1Count).cache()

when you sort your dataframe this problem does not occur because sorting
causes data to shuffle. Spark automatically caches when a data shuffle
happens.

Let me know if you get it to work.

Regards

On Mon, May 18, 2020 at 10:27 PM Mohit Durgapal <du...@gmail.com>
wrote:

> Dear All,
>
> I would like to know how, in spark 2.0, can I split a dataframe into two
> dataframes when I know the exact counts the two dataframes should have. I
> tried using limit but got quite weird results. Also, I am looking for exact
> counts in child dfs, not the approximate % based split.
>
> *Following is what I have tried:*
>
> var dfParent = sc.read.parquet("somelocation");// let's say it has 4000
> rows
>
> I want to split the parent into two dfs with the following counts:
>
> var dfChild1Count = 1000
>
> var dfChild2Count = 3000
>
> *I tried this: *
>
> var dfChild1 = dfParent.limit(dfChild1Count);
>
> var dfChild2 = dfParent.except(dfChild1);
>
> *and wrote that to output hdfs directories:*
>
> dfChild1.write.parquet("/outputfilechild1");
>
> dfChild2.write.parquet("/outputfilechild2");
>
> It turns out this results in some duplicates saved in
> files outputfilechild1 & outputfilechild2. Could anyone explain why they
> have duplicates?
>
> When I sorted my parent dataframe before limit, it then worked fine:
>
>
> *dfParent = dfParent.sortBy(col("unique_col").desc())*
> Seems like the limit on parent is executed twice and return different
> records each time. Not sure why it is executed twice when I mentioned only
> once.
>
> Also, Is there a better way to split a df into multiple dfs when we know
> exact counts of the child dfs?
>
>
>
>
> Regards,
> Mohit
>
>
>