You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ram Sriharsha (JIRA)" <ji...@apache.org> on 2015/05/05 17:45:00 UTC

[jira] [Comment Edited] (SPARK-5866) pyspark read from s3

    [ https://issues.apache.org/jira/browse/SPARK-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528686#comment-14528686 ] 

Ram Sriharsha edited comment on SPARK-5866 at 5/5/15 3:44 PM:
--------------------------------------------------------------

i'm not sure how the scala version is working..the exception suggests its looking for a path with scheme s3, when it should be s3n (The NativeS3FileSystem scheme is s3n)


was (Author: rams):
i'm not sure how the scala version is working..the exception suggests its looking for a path with protocol s3, when it should be s3n:// (The NativeS3FileSystem scheme is s3n)

> pyspark read from s3
> --------------------
>
>                 Key: SPARK-5866
>                 URL: https://issues.apache.org/jira/browse/SPARK-5866
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.1
>         Environment: mac OSx and ec2 ubuntu
>            Reporter: venu k tangirala
>
> I am trying to read data from s3 via pyspark, I gave the credentials with 
> sc= SparkContext()
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "key")
> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "secret_key")
> I also tried setting the credentials with core-site.xml and placed in the conf/ dir. 
> Interestingly, the same works with scala version of spark, both by setting the s3 key and secret key in scala code and also by setting it in core-site.xml
> The pySpark error is as follows :
> File "/Users/myname/path/./spark_json.py", line 55, in <module>
>     vals_table = sqlContext.inferSchema(values)
>   File "/Users/myname/spark-1.2.1/python/pyspark/sql.py", line 1332, in inferSchema
>     first = rdd.first()
>   File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first
>     rs = self.take(1)
>   File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
>     totalParts = self._jrdd.partitions().size()
>   File "/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py", line 538, in __call__
>     self.target_id, self.name)
>   File "/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py", line 300, in get_return_value
>     format(target_id, '.', name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
> : org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://bucketName/pathS3/1111_1417479684
> 	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
> 	at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
> 	at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> 	at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
> 	at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
> 	at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> 	at py4j.Gateway.invoke(Gateway.java:259)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:207)
> 	at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org