You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/07/13 07:25:20 UTC
[jira] [Commented] (SPARK-16518) Schema Compatibility of Parquet Data Source

    [ https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374518#comment-15374518 ] 

Hyukjin Kwon commented on SPARK-16518:
--------------------------------------

Cool, this was exactly what I have been wondering and looking into this. For the case 1, it seems the schema is read first from the footer and then it finds the matching fields with required fields (from Spark schema). But this time, it only finds the fields by **names**, This happends in {{ParquetReadSupport.clipParquetGroupFields()}}.( also it seems ORC is being also read in this way, See {{OrcFileformat.setRequiredColumns()}}.)

And then, in {{VectorizedParquetRecordReader.initialize()}} ({{SpecificParquetRecordReaderBase.initialize()}}), this is converted into Spark schema back. After that, it starts to read the record correctly (according to the schema from Parquet File).

So, Parquet-side, this reads the record correctly. But when it starts to load into Spark, spark calls {{ColumnBatch.getInt}} via {{ColumnBatch.Row.getInt}} to columns. Since the value is not set, it throws an {{NullPointException}}.

In a simple view, if there is a data as below in Parquet,

{code}
+------+
|     a|
+------+
|   "b"|
+------+
{code}

This is read correctly from Parquet,

{code}
Row("b")
{code}

Then, let's say Spark tries to read number from that row.

{code}
Row("b").getInt()
{code}

This throws an exception.

> Schema Compatibility of Parquet Data Source
> -------------------------------------------
>
>                 Key: SPARK-16518
>                 URL: https://issues.apache.org/jira/browse/SPARK-16518
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Xiao Li
>
> Currently, we are not checking the schema compatibility. Different file formats behave differently. This JIRA just summarizes what I observed for parquet data source tables.
> *Scenario 1 Data type mismatch*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure:
>  Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
> 	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
> 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
> {noformat}
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException:
>  Failed merging schema of file file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
> root
>  |-- a: integer (nullable = false)
>  |-- b: string (nullable = true)
> {noformat}
> *Scenario 2 More columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 string, col3 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema of the resultset is {{(col1 int, col2 string)}}.
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of the resultset is {{(col1 int, col2 string, col3 int)}}.
> *Scenario 3 Less columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int)}}
>    *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema of the resultset is {{(col1 int, col2 string)}}.
>    *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of the resultset is {{(col1 int)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org