You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "second_comet@yahoo.com.INVALID" <se...@yahoo.com.INVALID> on 2023/01/13 03:54:40 UTC
pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame
Good day,
May i know what is the different between pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe format?
I'm looking for a way to load a huge excel file (4-10GB), i wonder should i use third party library spark-excel or just use native pyspark.pandas ? I prefer to use Spark dataframe so that it uses the parallelization feature of Spark in the executors instead of running it on the driver.
Can help to advice ?
Detail---df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.load("/path/big_excel.xls")
print(type(df)) # output pyspark.sql.dataframe.DataFrame
import pyspark.pandas as ps
from pyspark.sql import DataFrame
path="/path/big-excel.xls"
df= ps.read_excel(path) # output pyspark.pandas.frame.DataFrame
Thank you.
Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame
Posted by Sean Owen <sr...@gmail.com>.
One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper
on a Pyspark DataFrame. They're the same thing with different APIs.
Neither has a 'storage format'.
spark-excel might be fine, and it's used with Spark DataFrames. Because it
emulates pandas's read_excel API, the Pyspark pandas DataFrame also has a
read_excel method that could work.
You can try both and see which works for you.
On Thu, Jan 12, 2023 at 9:56 PM second_comet@yahoo.com.INVALID
<se...@yahoo.com.invalid> wrote:
>
> Good day,
>
> May i know what is the different between pyspark.sql.dataframe.DataFrame
> versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe
> format?
>
> I'm looking for a way to load a huge excel file (4-10GB), i wonder should
> i use third party library spark-excel or just use native pyspark.pandas ?
> I prefer to use Spark dataframe so that it uses the parallelization
> feature of Spark in the executors instead of running it on the driver.
>
> Can help to advice ?
>
>
> Detail
> ---
>
> df = spark.read \ .format("com.crealytics.spark.excel") \ .option("header", "true") \ .load("/path/big_excel.xls")print(type(df)) # output pyspark.sql.dataframe.DataFrame
>
>
> import pyspark.pandas as psfrom pyspark.sql import DataFrame path="/path/big-excel.xls" df= ps.read_excel(path)
>
> # output pyspark.pandas.frame.DataFrame
>
>
> Thank you.
>
>
>