You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/10/03 15:00:11 UTC
[GitHub] [incubator-iceberg] andrei-ionescu opened a new issue #510: Cannot update an Iceberg dataset from a Parquet file due to "field should be required, but is optional"

andrei-ionescu opened a new issue #510: Cannot update an Iceberg dataset from a Parquet file due to "field should be required, but is optional"
URL: https://github.com/apache/incubator-iceberg/issues/510
 
 
   Given an Iceberg dataset found at `targetPath` with the following schema named `icebergSchema`:
   ```
   table {
     1: optionalField: optional string
     2: requiredField: required string
   }
   ```
   And given an Parquet file found at `sourcePath` with the following schema:
   ```
   message spark_schema {
     optional binary optionalField (UTF8);
     required binary requiredField (UTF8);
   }
   ```
   Please note that both have the same schema.
   
   When I do:
   ```java
   sparkSession
           .read().schema(SparkSchemaUtil.convert(icebergSchema)).parquet(sourcePath)
           .write().format("iceberg").mode(SaveMode.Append).save(targetPath);
   ```
   It fails with:
   ```
   Cannot write incompatible dataset to table with schema
   table {
     1: optionalField: optional string
     2: requiredField: required string
   }
   Problems:
   * requiredField should be required, but is optional
   ```
   
   I did debug it and found that at https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java#L135 Iceberg checks for `nullable` compatibility for each field which is a correct logic to apply.
   
   Now, at read time Spark loosens the nullability constraint by setting the `nullable` property of each field to `true`. This is done here: https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L380.
   
   From what I've found, there are two PRs trying to solve this:
   
   - https://github.com/apache/spark/pull/14124
   - https://github.com/apache/spark/pull/17293
   
   In the first one I found this interesting mailing archive [Re: Spark SQL: Preserving Dataframe Schema](https://www.mail-archive.com/user@spark.apache.org/msg39235.html) saying:
   
   > As an academic aside, note that all datatypes are nullable according to the
   > SQL Standard.
   
   In the second one I found the [following](https://github.com/apache/spark/pull/17293#issuecomment-482328688) thing as a conclusion:
   
   > Does the closure of this PR imply that setting nullable=false in a custom (user-defined) schema will never have an effect when loading CSV or JSON data from a file? In other words, if someone sets nullable: false in a custom JSON schema, as in the following scenario, it will be ignored?
   > ```
   > val customSchemaJson = <...some custom JSON schema...>
   > val customSchema = DataType.fromJson(customSchemaJson).asInstanceOf[StructType]
   > spark.read.schema(customSchema).json("/path/to/data.json")
   > ```
   
   With the response:
   
   >Yes, for now. That's what happens now.
   
   With all these, is quite impossible to directly read a parquet file and write its rows to an already present Iceberg dataset having a required field, even though the schemas are the same.
   
   The main use case is directly **updating an Iceberg dataset from a Parquet file**.
   
   Here is a test I put together to demonstrate the issue: https://gist.github.com/andrei-ionescu/e2d4918264b4acea1599ba2516d2f125.
   
   The **proposal** would be — understanding that Spark won't change it too soon — for **Iceberg to loosen up by not checking the nullability properties of the fields**: `readField.isRequired() && field.isOptional()`.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org