You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Philip <br...@gmail.com> on 2015/09/01 01:25:19 UTC

Re: IOError on createDataFrame

Pandas performance is definitely the issue here. You're using Pandas as an
ETL system, and it's more suitable as an endpoint rather than an conduit.
That is, it's great to dump your data there and do your analysis within
Pandas, subject to its constraints, but if you need to "back out" and use
something that can spread out into multiple machines' memory space, you'll
need to go back to the original data sources.

Here is some information about performance from the Odo project, which is a
Python ETL tool that supports Pandas and Spark. They report it taking on
the order of minutes to get 1M rows out of Pandas, and >11.5 hours to push
their 33GB test set through it.
http://odo.readthedocs.org/en/latest/perf.html

Can you format your data as CSV or JSON or something that allows use of
faster loading tools?

Philip

On Mon, Aug 31, 2015 at 7:50 AM, fsacerdoti <fs...@jumptrading.com>
wrote:

> There are two issues here:
>
> 1. Suppression of the true reason for failure. The spark runtime reports
> "TypeError" but that is not why the operation failed.
>
> 2. The low performance of loading a pandas dataframe.
>
>
> DISCUSSION
>
> Number (1) is easily fixed, and the primary purpose for my post.
> Number (2) is harder, and may lead us to abandon Spark. To answer Akhil,
> the
> process is too slow. Yes it will work, but with large dense datasets, the
> line
>
>     data = [r.tolist() for r in data.to_records(index=False)]
>
> is basically a brick wall. It will take longer to load the RDD than to do
> all operations on it, by a large margin.
>
> Any help or guidance (should we write some custom loader?) would be
> appreciated.
>
> FDS
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>