You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Devesh Raj Singh <ra...@gmail.com> on 2016/02/18 11:05:28 UTC
Reading CSV file using pyspark
Hi,
I want to read CSV file in pyspark
I am running pyspark on pycharm
I am trying to load a csv using pyspark
import os
import sys
os.environ['SPARK_HOME']="/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6"
sys.path.append("/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6/python/")
# Now we are ready to import Spark Modules
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.mllib.fpm import FPGrowth
print ("Successfully imported all Spark Modules")
except ImportError as e:
print ("Error importing Spark Modules", e)
sys.exit(1)
sc = SparkContext('local')
from pyspark.sql import HiveContext, SQLContext
from pyspark.sql import SQLContext
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('/Users/devesh/work/iris/iris.csv')
I am getting the following error
Py4JJavaError: An error occurred while calling o88.load.
: java.lang.ClassNotFoundException: Failed to load class for data source:
com.databricks.spark.csv.
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
--
Warm regards,
Devesh.
Re: Reading CSV file using pyspark
Posted by Gourav Sengupta <go...@gmail.com>.
Hi Devesh,
you have to start your SPARK Shell using the packages. The command is
mentioned below (you can use pyspark instead of spark-shell), anyways all
the required commands for this is mentioned here
https://github.com/databricks/spark-csv and I prefer using the 2.11 version
instead of 2.10 as there are some write issues which 2.11 resolves.
Hopefully you are using the latest release of SPARK.
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
Regards,
Gourav Sengupta
On Thu, Feb 18, 2016 at 11:05 AM, Teng Qiu <te...@gmail.com> wrote:
> download a right version of this jar
> http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 (or
> 2.11), and append it to SPARK_CLASSPATH
>
> 2016-02-18 11:05 GMT+01:00 Devesh Raj Singh <ra...@gmail.com>:
>
>> Hi,
>>
>> I want to read CSV file in pyspark
>>
>> I am running pyspark on pycharm
>> I am trying to load a csv using pyspark
>>
>> import os
>> import sys
>>
>> os.environ['SPARK_HOME']="/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6"
>> sys.path.append("/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6/python/")
>>
>> # Now we are ready to import Spark Modules
>> try:
>> from pyspark import SparkContext
>> from pyspark import SparkConf
>> from pyspark.mllib.fpm import FPGrowth
>> print ("Successfully imported all Spark Modules")
>> except ImportError as e:
>> print ("Error importing Spark Modules", e)
>> sys.exit(1)
>>
>>
>> sc = SparkContext('local')
>>
>> from pyspark.sql import HiveContext, SQLContext
>> from pyspark.sql import SQLContext
>>
>> df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/Users/devesh/work/iris/iris.csv')
>>
>> I am getting the following error
>>
>> Py4JJavaError: An error occurred while calling o88.load.
>> : java.lang.ClassNotFoundException: Failed to load class for data source:
>> com.databricks.spark.csv.
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
>> --
>> Warm regards,
>> Devesh.
>>
>
>
Re: Reading CSV file using pyspark
Posted by Teng Qiu <te...@gmail.com>.
download a right version of this jar
http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 (or 2.11),
and append it to SPARK_CLASSPATH
2016-02-18 11:05 GMT+01:00 Devesh Raj Singh <ra...@gmail.com>:
> Hi,
>
> I want to read CSV file in pyspark
>
> I am running pyspark on pycharm
> I am trying to load a csv using pyspark
>
> import os
> import sys
>
> os.environ['SPARK_HOME']="/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6"
> sys.path.append("/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6/python/")
>
> # Now we are ready to import Spark Modules
> try:
> from pyspark import SparkContext
> from pyspark import SparkConf
> from pyspark.mllib.fpm import FPGrowth
> print ("Successfully imported all Spark Modules")
> except ImportError as e:
> print ("Error importing Spark Modules", e)
> sys.exit(1)
>
>
> sc = SparkContext('local')
>
> from pyspark.sql import HiveContext, SQLContext
> from pyspark.sql import SQLContext
>
> df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/Users/devesh/work/iris/iris.csv')
>
> I am getting the following error
>
> Py4JJavaError: An error occurred while calling o88.load.
> : java.lang.ClassNotFoundException: Failed to load class for data source:
> com.databricks.spark.csv.
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
> --
> Warm regards,
> Devesh.
>