You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:44:13 UTC

[jira] [Resolved] (SPARK-24264) [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration

     [ https://issues.apache.org/jira/browse/SPARK-24264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24264.
----------------------------------
    Resolution: Incomplete

> [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-24264
>                 URL: https://issues.apache.org/jira/browse/SPARK-24264
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Gerard Maas
>            Priority: Major
>              Labels: bulk-closed, features, usability
>
> Looking into the Parquet format support for the File source in Structured Streaming, the docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1]
>  
> There seems to be no practical use of that configuration in a streaming context.
>  
> In its batch counterpart, `mergeSchemas` would infer the schema superset of the part-files found. 
>  
>  When using the File source + parquet format in streaming mode, we must provide a schema to the readStream.schema(...) builder and that schema is fixed for the duration of the stream.
>  
> My current understanding is that:
>  
> - Files containing a subset of the fields declared in the schema will render null values for the non-existing fields.
> - For files containing a superset of the fields, the additional data fields will be lost. 
> - Files not matching the schema set on the streaming source will render all fields null for each record in the file.
>  
> It looks like 'mergeSchema' has no practical effect, although enabling it might lead to additional processing to actually merge the Parquet schema of the input files. 
>  
> I inquired on the dev+user mailing lists about any other behavior but I got no responses.
>  
> From the user perspective, they may think that this option would help their job cope with schema evolution at runtime, but that is also not the case. 
>  
> Looks like removing this option and leaving the value always set to false is the reasonable thing to do.  
>  
> [1] [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org