You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Matt Cheah (JIRA)" <ji...@apache.org> on 2016/03/15 18:45:34 UTC

[jira] [Created] (SPARK-13912) spark.hadoop.* configurations are not applied for Parquet Data Frame Readers

Matt Cheah created SPARK-13912:
----------------------------------

             Summary: spark.hadoop.* configurations are not applied for Parquet Data Frame Readers
                 Key: SPARK-13912
                 URL: https://issues.apache.org/jira/browse/SPARK-13912
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Matt Cheah


I populated a SparkConf object passed to a SparkContext with some spark.hadoop.* configurations, expecting them to be used in the backing Hadoop file reading whenever I read from my DFS. However, when I was running some jobs, I noticed that the configurations were not being properly applied to the data frame reading when I used sqlContext.read().parquet().

I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading; instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object won't have Hadoop configurations set on the Spark Context. In general it seems like we have a discrepancy in how we set Hadoop configurations; when reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop configuration from the Spark Context, but for Data Frames we use SparkHadoopUtil.conf.

We should probably use the Spark Context hadoop configuration for Data Frames as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org