You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/01/21 21:00:15 UTC

[GitHub] [incubator-iceberg] ravichinoy opened a new pull request #745: schema evolution support

ravichinoy opened a new pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745
 
 
   1. skip check ordering
   2. use input schema to build accessors

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r370432184
 
 

 ##########
 File path: api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java
 ##########
 @@ -78,7 +78,7 @@
 
   private CheckCompatibility(Schema schema, boolean checkOrdering, boolean checkNullability) {
     this.schema = schema;
-    this.checkOrdering = checkOrdering;
+    this.checkOrdering = false;
 
 Review comment:
   I think this should still pass `checkOrdering` correctly. To fix the problem this is trying to address, I think this should add a write option to that gets passed into this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue merged pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r370432336
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/PartitionKey.java
 ##########
 @@ -48,17 +48,16 @@
   private final Accessor<InternalRow>[] accessors;
 
   @SuppressWarnings("unchecked")
-  PartitionKey(PartitionSpec spec) {
+  PartitionKey(PartitionSpec spec, Schema inputSchema) {
     this.spec = spec;
-
 
 Review comment:
   Can you add back these newlines? We try to avoid non-functional changes like this because they can cause commit conflicts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-584954078
 
 
   @ravichinoy, thanks for working on this. It looks good.
   
   Can you either fix the methods that use a boxed Boolean or implement the builder-like API I suggested? I can add the builder in a follow-up if you want to get this in more quickly.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-585969162
 
 
   Thanks for updating this. I merged it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r377986573
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
 ##########
 @@ -189,12 +189,13 @@ private static void mergeIcebergHadoopConfs(
         .forEach(key -> baseConf.set(key.replaceFirst("hadoop.", ""), options.get(key)));
   }
 
-  private void validateWriteSchema(Schema tableSchema, Schema dsSchema, Boolean checkNullability) {
+  private void validateWriteSchema(
+          Schema tableSchema, Schema dsSchema, Boolean checkNullability, Boolean checkOrdering) {
 
 Review comment:
   It looks like this is getting too complicated. We now have 2 boolean options, one of which is used to select the compatibility checking method (which have similar names) and the other is passed as an arg (which isn't readable). I think it is a good idea to convert compatibility checking to a builder-like pattern:
   
   ```java
   CheckCompatibility
       .writeSchema(dsSchema)
       .readSchema(tableSchema)
       .checkOrdering(true)
       .checkNullability(false)
       .throwOnValidationError();
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-577946018
 
 
   > Do you also have correctness concerns about dropping the ordering check?
   
   I don't have correctness concerns. Handling field reordering is a requirement, and Spark recently added DDL statements to reorder fields.
   
   Mainly, I wonder if it is a good idea to write data files with a column order that depends on the incoming query instead of the table's current column order. Reordering columns can change performance in Avro and it may be surprising when people look at a file in the table and see columns in an unexpected order. These aren't correctness blockers, but if this is going to be moot in the next Spark release then my thinking was that we don't need to choose what the behavior _should_ be.
   
   > Do you have any concerns with the second half of @ravichinoy's change (i.e. switching from using the Iceberg table schema to the Spark schema when building the PartitionKey accessor)?
   
   No. And now that you mentioned your concerns, I think that's probably a good idea to use the incoming schema.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-577951622
 
 
   @ravichinoy & @davrmac, I think we can get this in if we add a write config option that controls whether to do the order check, since we already have a boolean for it. We just need to pass it in from writes and then it's up to the user whether to require the same column order as the table. Does that work for you?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] davrmac commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
davrmac commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r372228688
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
 ##########
 @@ -189,10 +189,11 @@ private static void mergeIcebergHadoopConfs(
         .forEach(key -> baseConf.set(key.replaceFirst("hadoop.", ""), options.get(key)));
   }
 
-  private void validateWriteSchema(Schema tableSchema, Schema dsSchema, Boolean checkNullability) {
+  private void validateWriteSchema(
+          Schema tableSchema, Schema dsSchema, Boolean checkNullability, Boolean checkOrdering) {
     List<String> errors;
     if (checkNullability) {
-      errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema);
+      errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema, checkOrdering);
     } else {
       errors = CheckCompatibility.typeCompatibilityErrors(tableSchema, dsSchema);
 
 Review comment:
   @ravichinoy checkOrdering needs to be passed here as well. You should be able to turn off both the nullability and ordering checks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r370432541
 
 

 ##########
 File path: spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java
 ##########
 @@ -132,7 +132,7 @@ public void testNullPartitionValue() throws Exception {
 
     try {
       // TODO: incoming columns must be ordered according to the table's schema
 
 Review comment:
   We should be able to remove this TODO.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r377985010
 
 

 ##########
 File path: api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java
 ##########
 @@ -39,7 +39,36 @@
    * @return a list of error details, or an empty list if there are no compatibility problems
    */
   public static List<String> writeCompatibilityErrors(Schema readSchema, Schema writeSchema) {
-    return TypeUtil.visit(readSchema, new CheckCompatibility(writeSchema, true, true));
+    return writeCompatibilityErrors(readSchema, writeSchema, true);
+  }
+
+  /**
+   * Returns a list of compatibility errors for writing with the given write schema.
+   * This includes nullability: writing optional (nullable) values to a required field is an error
+   * Optionally this method allows case where input schema has different ordering than table schema.
+   * @param readSchema a read schema
+   * @param writeSchema a write schema
+   * @param checkOrdering If false, allow input schema to have different ordering than table schema
+   * @return a list of error details, or an empty list if there are no compatibility problems
+   */
+  public static List<String> writeCompatibilityErrors(Schema readSchema, Schema writeSchema, Boolean checkOrdering) {
 
 Review comment:
   Why is this a boxed boolean?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-577428854
 
 
   From #741, it looks like the problem this is trying to address is that you can't write to a table with a different column order than the order of the table schema.
   
   The reason for this restriction is that Spark should be responsible for reconciling table columns with the data from a query. Spark has two different modes for doing this: by position for SQL, and by name for DataFrame writes. I think that delegating this to Spark is the right long-term solution.
   
   In the short term, Spark 2.4 has no resolution step for v2 writes. That's why Iceberg has the current checks that it does, so that you don't corrupt a table by writing the wrong thing. I'd rather not extend the current checks if we don't need to, given that this will be handled by Spark in the next version. Is that reasonable?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r370431898
 
 

 ##########
 File path: api/src/test/java/org/apache/iceberg/types/TestReadabilityChecks.java
 ##########
 @@ -331,26 +331,6 @@ public void testIncompatibleListAndPrimitive() {
         errors.get(0).contains("list cannot be read as a string"));
   }
 
-  @Test
-  public void testStructWriteReordering() {
 
 Review comment:
   This test is still valid when `checkOrdering` is `true`. Can you restore it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#discussion_r370432465
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Writer.java
 ##########
 @@ -491,10 +491,10 @@ public void write(InternalRow row) throws IOException {
         AppenderFactory<InternalRow> appenderFactory,
         WriterFactory.OutputFileFactory fileFactory,
         FileIO fileIo,
-        long targetFileSize) {
+        long targetFileSize,
+        Schema schema) {
 
 Review comment:
   Can we call this `writeSchema` or some name that indicates it is the schema of the incoming data, not necessarily the table schema?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] davrmac commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
davrmac commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-577872673
 
 
   Hi @rdblue, I'm working with @ravichinoy on this issue and had a couple of follow up questions:
   
   1) For the compatibility check change is your concern just about not updating a code-path that's about to be deprecated, or do you also have correctness concerns about dropping the ordering check? From my understanding of the Spark-to-Iceberg write path, Spark has already bound field names to all of the columns being written by the time it calls the IcebergSource to create a writer, and Iceberg uses those field names (and not the field order) to perform the mapping back to the corresponding fieldIds in the Iceberg schema (which it then uses to perform the rest of its compatibility checks). Are there other parts of the Iceberg write path that still depend on field order that would cause corruptions if the ordering check wasn't enforced?
   
   2) Do you have any concerns with the second half of @ravichinoy's change (i.e. switching from using the Iceberg table schema to the Spark schema when building the PartitionKey accessor)? It's less relevant if we're still enforcing that the Spark schema match the field order of the Iceberg table schema, but I believe there are still scenarios where, if the Spark schema is missing optional fields that are defined in the Iceberg table schema, the write can pass the compatibility check but build a PartitionKey accessor that fails to pull the correct fields out of each InternalRow (we can try writing a test case to prove whether or not this is actually an issue).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] ravichinoy commented on issue #745: schema evolution support

Posted by GitBox <gi...@apache.org>.
ravichinoy commented on issue #745: schema evolution support
URL: https://github.com/apache/incubator-iceberg/pull/745#issuecomment-585969263
 
 
   thanks @rdblue we will try to implement the Builder pattern as part of follow up PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org