You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Davies Liu <da...@databricks.com> on 2015/07/02 22:21:30 UTC
Re: is there any significant performance issue converting between rdd
and dataframes in pyspark?
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl <ax...@whisperstream.com> wrote:
> In pyspark, when I convert from rdds to dataframes it looks like the rdd is
> being materialized/collected/repartitioned before it's converted to a
> dataframe.
It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.
> Just wondering if there's any guidelines for doing this conversion and
> whether it's best to do it early to get the performance benefits of
> dataframes or weigh that against the size/number of items in the rdd.
It's better to do it as early as possible, I think.
> Thanks,
>
> -Axel
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org