You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by AS...@nz.imshealth.com on 2017/03/07 03:52:06 UTC

Check if dataframe is empty

Hello!

I am pretty sure that I am asking something which has been already asked lots of times. However, I cannot find the question in the mailing list archive.

The question is - I need to check whether dataframe is empty or not. I receive a dataframe from 3rd party library and this dataframe can be potentially empty, but also can be really huge - millions of rows. Thus, I want to avoid of doing some logic in case the dataframe is empty. How can I efficiently check it?

Right now I am doing it in the following way:

private def isEmpty(df: Option[DataFrame]): Boolean = {
  df.isEmpty || (df.isDefined && df.get.limit(1).rdd.isEmpty())
}

But the performance is really slow for big dataframes. I would be grateful for any suggestions.

Thank you in advance.


Best regards,

Artem

________________________________
********************** IMPORTANT--PLEASE READ ************************ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you. ************************************************************************

RE: Check if dataframe is empty

Posted by AS...@nz.imshealth.com.
Thank you for the prompt response. But why is it faster? There is an implementation of isEmpty for rdd:

def isEmpty(): Boolean = withScope {
  partitions.length == 0 || take(1).length == 0
}


Basically, the same take(1). Is it because of limit?


Regards,

Artem Shaitarov

From: jasbir.sing@accenture.com [mailto:jasbir.sing@accenture.com]
Sent: Tuesday, 7 March 2017 5:04 p.m.
To: Artem Shaitarov <AS...@nz.imshealth.com>; user@spark.apache.org
Subject: RE: Check if dataframe is empty

Dataframe.take(1) is faster.

From: AShaitarov@nz.imshealth.com<ma...@nz.imshealth.com> [mailto:AShaitarov@nz.imshealth.com]
Sent: Tuesday, March 07, 2017 9:22 AM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Check if dataframe is empty

Hello!

I am pretty sure that I am asking something which has been already asked lots of times. However, I cannot find the question in the mailing list archive.

The question is - I need to check whether dataframe is empty or not. I receive a dataframe from 3rd party library and this dataframe can be potentially empty, but also can be really huge - millions of rows. Thus, I want to avoid of doing some logic in case the dataframe is empty. How can I efficiently check it?

Right now I am doing it in the following way:

private def isEmpty(df: Option[DataFrame]): Boolean = {
  df.isEmpty || (df.isDefined && df.get.limit(1).rdd.isEmpty())
}

But the performance is really slow for big dataframes. I would be grateful for any suggestions.

Thank you in advance.


Best regards,

Artem

________________________________
********************** IMPORTANT--PLEASE READ ************************ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you. ************************************************************************

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com<http://www.accenture.com>

Re: Check if dataframe is empty

Posted by Deepak Sharma <de...@gmail.com>.
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> df.take(1).isEmpty should work


My bad.
It will return empty array:
 emptydf.take(1)
res0: Array[org.apache.spark.sql.Row] = Array()

and applying isEmpty would return boolean
 emptydf.take(1).isEmpty
res2: Boolean = true




-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Check if dataframe is empty

Posted by Nick Pentreath <ni...@gmail.com>.
I believe take on an empty dataset will return an empty Array rather than
throw an exception.

df.take(1).isEmpty should work

On Tue, 7 Mar 2017 at 07:42, Deepak Sharma <de...@gmail.com> wrote:

> If the df is empty , the .take would return
> java.util.NoSuchElementException.
> This can be done as below:
> df.rdd.isEmpty
>
>
> On Tue, Mar 7, 2017 at 9:33 AM, <ja...@accenture.com> wrote:
>
> Dataframe.take(1) is faster.
>
>
>
> *From:* AShaitarov@nz.imshealth.com [mailto:AShaitarov@nz.imshealth.com]
> *Sent:* Tuesday, March 07, 2017 9:22 AM
> *To:* user@spark.apache.org
> *Subject:* Check if dataframe is empty
>
>
>
> Hello!
>
>
>
> I am pretty sure that I am asking something which has been already asked
> lots of times. However, I cannot find the question in the mailing list
> archive.
>
>
>
> The question is – I need to check whether dataframe is empty or not. I
> receive a dataframe from 3rd party library and this dataframe can be
> potentially empty, but also can be really huge – millions of rows. Thus, I
> want to avoid of doing some logic in case the dataframe is empty. How can I
> efficiently check it?
>
>
>
> Right now I am doing it in the following way:
>
>
>
> *private def *isEmpty(df: Option[DataFrame]): Boolean = {
>   df.isEmpty || (df.isDefined && df.get.limit(1).*rdd*.isEmpty())
> }
>
>
>
> But the performance is really slow for big dataframes. I would be grateful
> for any suggestions.
>
>
>
> Thank you in advance.
>
>
>
>
> Best regards,
>
>
>
> Artem
>
>
> ------------------------------
>
> ********************** IMPORTANT--PLEASE READ ************************
> This electronic message, including its attachments, is CONFIDENTIAL and may
> contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is
> intended for the authorized recipient of the sender. If you are not the
> intended recipient, you are hereby notified that any use, disclosure,
> copying, or distribution of this message or any of the information included
> in it is unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail and
> permanently delete this message and its attachments, along with any copies
> thereof, from all locations received (e.g., computer, mobile device, etc.).
> Thank you.
> ************************************************************************
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Re: Check if dataframe is empty

Posted by Deepak Sharma <de...@gmail.com>.
If the df is empty , the .take would return
java.util.NoSuchElementException.
This can be done as below:
df.rdd.isEmpty


On Tue, Mar 7, 2017 at 9:33 AM, <ja...@accenture.com> wrote:

> Dataframe.take(1) is faster.
>
>
>
> *From:* AShaitarov@nz.imshealth.com [mailto:AShaitarov@nz.imshealth.com]
> *Sent:* Tuesday, March 07, 2017 9:22 AM
> *To:* user@spark.apache.org
> *Subject:* Check if dataframe is empty
>
>
>
> Hello!
>
>
>
> I am pretty sure that I am asking something which has been already asked
> lots of times. However, I cannot find the question in the mailing list
> archive.
>
>
>
> The question is – I need to check whether dataframe is empty or not. I
> receive a dataframe from 3rd party library and this dataframe can be
> potentially empty, but also can be really huge – millions of rows. Thus, I
> want to avoid of doing some logic in case the dataframe is empty. How can I
> efficiently check it?
>
>
>
> Right now I am doing it in the following way:
>
>
>
> *private def *isEmpty(df: Option[DataFrame]): Boolean = {
>   df.isEmpty || (df.isDefined && df.get.limit(1).*rdd*.isEmpty())
> }
>
>
>
> But the performance is really slow for big dataframes. I would be grateful
> for any suggestions.
>
>
>
> Thank you in advance.
>
>
>
>
> Best regards,
>
>
>
> Artem
>
>
> ------------------------------
>
> ********************** IMPORTANT--PLEASE READ ************************
> This electronic message, including its attachments, is CONFIDENTIAL and may
> contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is
> intended for the authorized recipient of the sender. If you are not the
> intended recipient, you are hereby notified that any use, disclosure,
> copying, or distribution of this message or any of the information included
> in it is unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail and
> permanently delete this message and its attachments, along with any copies
> thereof, from all locations received (e.g., computer, mobile device, etc.).
> Thank you. ************************************************************
> ************
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> ____________________________________________________________
> __________________________
>
> www.accenture.com
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

RE: Check if dataframe is empty

Posted by ja...@accenture.com.
Dataframe.take(1) is faster.

From: AShaitarov@nz.imshealth.com [mailto:AShaitarov@nz.imshealth.com]
Sent: Tuesday, March 07, 2017 9:22 AM
To: user@spark.apache.org
Subject: Check if dataframe is empty

Hello!

I am pretty sure that I am asking something which has been already asked lots of times. However, I cannot find the question in the mailing list archive.

The question is - I need to check whether dataframe is empty or not. I receive a dataframe from 3rd party library and this dataframe can be potentially empty, but also can be really huge - millions of rows. Thus, I want to avoid of doing some logic in case the dataframe is empty. How can I efficiently check it?

Right now I am doing it in the following way:

private def isEmpty(df: Option[DataFrame]): Boolean = {
  df.isEmpty || (df.isDefined && df.get.limit(1).rdd.isEmpty())
}

But the performance is really slow for big dataframes. I would be grateful for any suggestions.

Thank you in advance.


Best regards,

Artem

________________________________
********************** IMPORTANT--PLEASE READ ************************ This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). Thank you. ************************************************************************

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
______________________________________________________________________________________

www.accenture.com