You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yogesh Vyas <in...@gmail.com> on 2017/04/10 05:19:27 UTC

pandas DF Dstream to Spark DF

Hi,

I am writing a pyspark streaming job in which i am returning a pandas data
frame as DStream. Now I wanted to save this DStream dataframe to parquet
file. How to do that?

I am trying to convert it to spark data frame but I am getting multiple
errors. Please suggest me how to do that.

Regards,
Yogesh

Re: pandas DF Dstream to Spark DF

Posted by Bryan Cutler <cu...@gmail.com>.
Hi Yogesh,

It would be easier to help if you included your code and the exact error
messages that occur.  If you are creating a Spark DataFrame with a Pandas
DataFrame, then Spark does not read the schema and infers from the data to
make one.  This might be the cause of your issue if the schema is not
inferred correctly.  You can try to specify the schema manually, like this
for example

schema = StructType([
            StructField("str_t", StringType(), True),
            StructField("int_t", IntegerType(), True),
            StructField("double_t", DoubleType(), True)])

pandas_df = pandas.DataFrame(data={...})
spark_df = spark.createDataFrame(pandas_df, schema=schema)

This step might be eliminated by using Apache Arrow, see SPARK-13534 for
related work.

On Sun, Apr 9, 2017 at 10:19 PM, Yogesh Vyas <in...@gmail.com> wrote:

> Hi,
>
> I am writing a pyspark streaming job in which i am returning a pandas data
> frame as DStream. Now I wanted to save this DStream dataframe to parquet
> file. How to do that?
>
> I am trying to convert it to spark data frame but I am getting multiple
> errors. Please suggest me how to do that.
>
> Regards,
> Yogesh
>