You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/06/28 12:47:05 UTC

[jira] [Assigned] (SPARK-8690) Add a setting to disable SparkSQL parquet schema merge by using datasource API

     [ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-8690:
-----------------------------------

    Assignee:     (was: Apache Spark)

> Add a setting to disable SparkSQL parquet schema merge by using datasource API 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-8690
>                 URL: https://issues.apache.org/jira/browse/SPARK-8690
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: all
>            Reporter: thegiive
>            Priority: Minor
>
> We need a general config to disable the parquet schema merge feature. 
> Our sparkSQL application requirement is 
> # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't want increase too much read parquet time. Around 2000 parquet file,  the schema is the same. So we don't need  schema merge feature
> # We need to use datasource API's feature like partition discovery. So we cannot use Spark 1.2 or pervious version 
> # We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to change the application code. One setting to disable this feature is what we want 
> In  1.4, we have serval method. But both of them cannot perfect match our use case 
> # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 1,3. But it will use old parquet API and fail in requirement 2 
> # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" ))  will meet requirement 1,2. But it need to change a lot of code we use in parquet load. 
> # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default version of parquet will increase the load time from 1~5 sec to 100 sec. It will fail requirement 1. 
> # Try PR 5231 config. But it  cannot disable schema merge. 
> I think it is better to use a config to disable datasource API's schema merge feature. A PR will be provide later 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org