You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Davies Liu <da...@databricks.com> on 2015/07/02 22:21:30 UTC

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl <ax...@whisperstream.com> wrote:
> In pyspark, when I convert from rdds to dataframes it looks like the rdd is
> being materialized/collected/repartitioned before it's converted to a
> dataframe.

It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.

> Just wondering if there's any guidelines for doing this conversion and
> whether it's best to do it early to get the performance benefits of
> dataframes or weigh that against the size/number of items in the rdd.

It's better to do it as early as possible, I think.

> Thanks,
>
> -Axel
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org