You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "second_comet@yahoo.com.INVALID" <se...@yahoo.com.INVALID> on 2023/01/13 03:54:40 UTC

pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

Good day,
May i know what is the different between pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe format?
 I'm looking for a way to load a huge excel file (4-10GB), i wonder should i use third party library spark-excel or just use native pyspark.pandas ? I prefer to use Spark dataframe so that it uses the parallelization feature of Spark in the executors instead of running it on the driver. 

Can help to advice ?

Detail---df = spark.read \
    .format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .load("/path/big_excel.xls")

print(type(df)) # output pyspark.sql.dataframe.DataFrame


import pyspark.pandas as ps
from pyspark.sql import DataFrame  

path="/path/big-excel.xls" 

df= ps.read_excel(path) # output pyspark.pandas.frame.DataFrame
Thank you.

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

Posted by Sean Owen <sr...@gmail.com>.

One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper
on a Pyspark DataFrame. They're the same thing with different APIs.
Neither has a 'storage format'.

spark-excel might be fine, and it's used with Spark DataFrames. Because it
emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a
read_excel method that could work.
You can try both and see which works for you.

On Thu, Jan 12, 2023 at 9:56 PM second_comet@yahoo.com.INVALID
<se...@yahoo.com.invalid> wrote:

>
> Good day,
>
> May i know what is the different between pyspark.sql.dataframe.DataFrame
> versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe
> format?
>
> I'm looking for a way to load a huge excel file (4-10GB), i wonder should
> i use third party library spark-excel or just use native pyspark.pandas ?
> I prefer to use Spark dataframe so that it uses the parallelization
> feature of Spark in the executors instead of running it on the driver.
>
> Can help to advice ?
>
>
> Detail
> ---
>
> df = spark.read \    .format("com.crealytics.spark.excel") \    .option("header", "true") \    .load("/path/big_excel.xls")print(type(df)) # output pyspark.sql.dataframe.DataFrame
>
>
> import pyspark.pandas as psfrom pyspark.sql import DataFrame  path="/path/big-excel.xls" df= ps.read_excel(path)
>
> # output pyspark.pandas.frame.DataFrame
>
>
> Thank you.
>
>
>