You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Charlie Hack <ch...@gmail.com> on 2015/08/20 22:08:16 UTC

Creating Spark DataFrame from large pandas DataFrame

Hi,

I'm new to spark and am trying to create a Spark df from a pandas df with
~5 million rows. Using Spark 1.4.1.

When I type:

df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))

(the df.where is a hack I found on the Spark JIRA to avoid a problem with
NaN values making mixed column types)

I get:

TypeError: cannot create an RDD from type: <type 'list'>

Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
this issue?


This is already a workaround-- ideally I'd like to read the spark dataframe
from a Hive table. But this is currently not an option for my setup.

I also tried reading the data into spark from a CSV using spark-csv.
Haven't been able to make this work as yet. I launch

$ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar

and when I attempt to read the csv I get:

Py4JJavaError: An error occurred while calling o22.load. :
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...

Other options I can think of:

- Convert my CSV to json (use Pig?) and read into Spark
- Read in using jdbc connect from postgres

But want to make sure I'm not misusing Spark or missing something obvious.

Thanks!

Charlie

Re: Creating Spark DataFrame from large pandas DataFrame

Posted by ayan guha <gu...@gmail.com>.

The easiest option I found to put jars in SPARK CLASSPATH
On 21 Aug 2015 06:20, "Burak Yavuz" <br...@gmail.com> wrote:

> If you would like to try using spark-csv, please use
> `pyspark --packages com.databricks:spark-csv_2.11:1.2.0`
>
> You're missing a dependency.
>
> Best,
> Burak
>
> On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack <ch...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm new to spark and am trying to create a Spark df from a pandas df with
>> ~5 million rows. Using Spark 1.4.1.
>>
>> When I type:
>>
>> df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))
>>
>> (the df.where is a hack I found on the Spark JIRA to avoid a problem with
>> NaN values making mixed column types)
>>
>> I get:
>>
>> TypeError: cannot create an RDD from type: <type 'list'>
>>
>> Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
>> this issue?
>>
>>
>> This is already a workaround-- ideally I'd like to read the spark
>> dataframe from a Hive table. But this is currently not an option for my
>> setup.
>>
>> I also tried reading the data into spark from a CSV using spark-csv.
>> Haven't been able to make this work as yet. I launch
>>
>> $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar
>>
>> and when I attempt to read the csv I get:
>>
>> Py4JJavaError: An error occurred while calling o22.load. :
>> java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...
>>
>> Other options I can think of:
>>
>> - Convert my CSV to json (use Pig?) and read into Spark
>> - Read in using jdbc connect from postgres
>>
>> But want to make sure I'm not misusing Spark or missing something obvious.
>>
>> Thanks!
>>
>> Charlie
>>
>
>

Re: Creating Spark DataFrame from large pandas DataFrame

Posted by Burak Yavuz <br...@gmail.com>.

If you would like to try using spark-csv, please use
`pyspark --packages com.databricks:spark-csv_2.11:1.2.0`

You're missing a dependency.

Best,
Burak

On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack <ch...@gmail.com>
wrote:

> Hi,
>
> I'm new to spark and am trying to create a Spark df from a pandas df with
> ~5 million rows. Using Spark 1.4.1.
>
> When I type:
>
> df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))
>
> (the df.where is a hack I found on the Spark JIRA to avoid a problem with
> NaN values making mixed column types)
>
> I get:
>
> TypeError: cannot create an RDD from type: <type 'list'>
>
> Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
> this issue?
>
>
> This is already a workaround-- ideally I'd like to read the spark
> dataframe from a Hive table. But this is currently not an option for my
> setup.
>
> I also tried reading the data into spark from a CSV using spark-csv.
> Haven't been able to make this work as yet. I launch
>
> $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar
>
> and when I attempt to read the csv I get:
>
> Py4JJavaError: An error occurred while calling o22.load. :
> java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...
>
> Other options I can think of:
>
> - Convert my CSV to json (use Pig?) and read into Spark
> - Read in using jdbc connect from postgres
>
> But want to make sure I'm not misusing Spark or missing something obvious.
>
> Thanks!
>
> Charlie
>