You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/06/28 12:47:05 UTC
[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL
parquet schema merge by using datasource API
[ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-8690:
-----------------------------------
Assignee: (was: Apache Spark)
> Add a setting to disable SparkSQL parquet schema merge by using datasource API
> -------------------------------------------------------------------------------
>
> Key: SPARK-8690
> URL: https://issues.apache.org/jira/browse/SPARK-8690
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.4.0
> Environment: all
> Reporter: thegiive
> Priority: Minor
>
> We need a general config to disable the parquet schema merge feature.
> Our sparkSQL application requirement is
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't want increase too much read parquet time. Around 2000 parquet file, the schema is the same. So we don't need schema merge feature
> # We need to use datasource API's feature like partition discovery. So we cannot use Spark 1.2 or pervious version
> # We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to change the application code. One setting to disable this feature is what we want
> In 1.4, we have serval method. But both of them cannot perfect match our use case
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 1,3. But it will use old parquet API and fail in requirement 2
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" )) will meet requirement 1,2. But it need to change a lot of code we use in parquet load.
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default version of parquet will increase the load time from 1~5 sec to 100 sec. It will fail requirement 1.
> # Try PR 5231 config. But it cannot disable schema merge.
> I think it is better to use a config to disable datasource API's schema merge feature. A PR will be provide later
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org