You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Andrey Taptunov (JIRA)" <ji...@apache.org> on 2017/07/11 15:25:00 UTC

[jira] [Created] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

Andrey Taptunov created SPARK-21374:
---------------------------------------

             Summary: Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
                 Key: SPARK-21374
                 URL: https://issues.apache.org/jira/browse/SPARK-21374
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.1, 2.0.2
            Reporter: Andrey Taptunov


*Motivation:*
Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. 

*Example (works for RDD but fails for DataFrame):*

{code:java}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {

    val awsAccessKeyId = "something"
    val awsSecretKey = "something else"

    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")

    val sc = new SparkContext(conf)
    sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
    sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
    sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
    sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")

    val spark = SparkSession.builder().config(conf).getOrCreate()

    val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
    val rddGlob = sc.textFile("s3://bucket/*").count // ok
    val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok
    val dfGlob = spark.read.format("csv").load("s3://bucket/*").count // IllegalArgumentExcepton

    sc.stop()
  }
}

{code}

*Analysis:*
It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user.

I would be happy to help with pull request if you agree that this is a bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org