You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2017/08/04 22:24:02 UTC

[jira] [Assigned] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

     [ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiao Li reassigned SPARK-21374:
-------------------------------

    Assignee: Andrey Taptunov

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-21374
>                 URL: https://issues.apache.org/jira/browse/SPARK-21374
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2, 2.1.1
>            Reporter: Andrey Taptunov
>            Assignee: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's access key and secret key on the fly to read from buckets with different permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
>     val awsAccessKeyId = "something"
>     val awsSecretKey = "something else"
>     val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
>     val sc = new SparkContext(conf)
>     sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
>     sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
>     sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
>     sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>     sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
>     val spark = SparkSession.builder().config(conf).getOrCreate()
>     val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
>     val rddGlob = sc.textFile("s3://bucket/*").count // ok
>     val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok
>     
>     val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
>     // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively)
>     // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
>    
>     sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org