You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matt Cheah (JIRA)" <ji...@apache.org> on 2016/03/15 18:55:33 UTC

[jira] [Comment Edited] (SPARK-13912) spark.hadoop.* configurations are not applied for Parquet Data Frame Readers

    [ https://issues.apache.org/jira/browse/SPARK-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195797#comment-15195797 ] 

Matt Cheah edited comment on SPARK-13912 at 3/15/16 5:54 PM:
-------------------------------------------------------------

it's not exactly, if I'm reading the PR for SPARK-13403 right. The other ticket and PR appears to cover the Hive-specific code path, but this code path is in DataSourceStrategy.

Edit: So more precisely I don't think we'll get the desired behavior if we aren't using HiveContext / HiveConfs even with the fix to SPARK-13403.


was (Author: mcheah):
it's not exactly, if I'm reading the PR for SPARK-13403 right. The other ticket and PR appears to cover the Hive-specific code path, but this code path is in DataSourceStrategy.

> spark.hadoop.* configurations are not applied for Parquet Data Frame Readers
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-13912
>                 URL: https://issues.apache.org/jira/browse/SPARK-13912
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Matt Cheah
>
> I populated a SparkConf object passed to a SparkContext with some spark.hadoop.* configurations, expecting them to be used in the backing Hadoop file reading whenever I read from my DFS. However, when I was running some jobs, I noticed that the configurations were not being properly applied to the data frame reading when I used sqlContext.read().parquet().
> I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading; instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object won't have Hadoop configurations set on the Spark Context. In general it seems like we have a discrepancy in how we set Hadoop configurations; when reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop configuration from the Spark Context, but for Data Frames we use SparkHadoopUtil.conf.
> We should probably use the Spark Context hadoop configuration for Data Frames as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org