You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by zhangliyun <ke...@126.com> on 2019/12/02 05:05:52 UTC

A question about radd bytes size

Hi:


 I want to get the total bytes of a DataFrame by following function , but when I insert the DataFrame into hive , I found the value of the function is different from spark.sql.statistics.totalSize .  The spark.sql.statistics.totalSize  is less than the result of following function getRDDBytes . 


   def getRDDBytes(df:DataFrame):Long={

  df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
        rddOfDataframe.reduce(_ + _)
      }

      size
  }

}
Appreciate if you can provide your suggestion.


Best Regards
Kelly Zhang


Re: A question about radd bytes size

Posted by Wenchen Fan <cl...@gmail.com>.
When we talk about bytes size, we need to specify how the data is stored.
For example, if we cache the dataframe, then the bytes size is the number
of bytes of the binary format of the table cache. If we write to hive
tables, then the bytes size is the total size of the data files of the
table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun <ke...@126.com> wrote:

> Hi:
>
>  I want to get the total bytes of a DataFrame by following function , but
> when I insert the DataFrame into hive , I found the value of the function
> is different from spark.sql.statistics.totalSize .  The
> spark.sql.statistics.totalSize  is less than the result of following
> function getRDDBytes .
>
>    def getRDDBytes(df:DataFrame):Long={
>
>
>   df.rdd.getNumPartitions match {
>     case 0 =>
>       0
>     case numPartitions =>
>       val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>       val size = if (rddOfDataframe.isEmpty()) {
>         0
>       } else {
>         rddOfDataframe.reduce(_ + _)
>       }
>
>       size
>   }
> }
> Appreciate if you can provide your suggestion.
>
> Best Regards
> Kelly Zhang
>
>
>
>
>

Re: A question about radd bytes size

Posted by Wenchen Fan <cl...@gmail.com>.
When we talk about bytes size, we need to specify how the data is stored.
For example, if we cache the dataframe, then the bytes size is the number
of bytes of the binary format of the table cache. If we write to hive
tables, then the bytes size is the total size of the data files of the
table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun <ke...@126.com> wrote:

> Hi:
>
>  I want to get the total bytes of a DataFrame by following function , but
> when I insert the DataFrame into hive , I found the value of the function
> is different from spark.sql.statistics.totalSize .  The
> spark.sql.statistics.totalSize  is less than the result of following
> function getRDDBytes .
>
>    def getRDDBytes(df:DataFrame):Long={
>
>
>   df.rdd.getNumPartitions match {
>     case 0 =>
>       0
>     case numPartitions =>
>       val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>       val size = if (rddOfDataframe.isEmpty()) {
>         0
>       } else {
>         rddOfDataframe.reduce(_ + _)
>       }
>
>       size
>   }
> }
> Appreciate if you can provide your suggestion.
>
> Best Regards
> Kelly Zhang
>
>
>
>
>