You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/20 12:53:38 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

ajantha-bhat opened a new pull request, #6461:
URL: https://github.com/apache/iceberg/pull/6461

   Currently, `sort-order-id` is only persisted in `TableMetadata` when data is written from Spark.
   But [as per spec](https://iceberg.apache.org/spec/#manifests), it has to be stored in `data_file` of `manifest_entry` also. 
   
   This issue was initially observed from `_files` metadata table. After debugging found that metadata table code is fine but the writer itself not writing it to manifest. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1054051036


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -654,6 +654,7 @@ public DataWriter<InternalRow> createWriter(int partitionId, long taskId, long e
               .dataFileFormat(format)
               .dataSchema(writeSchema)
               .dataSparkType(dsSchema)
+              .dataSortOrder(table.sortOrder())

Review Comment:
   I have updated it to handle "write.distribution-mode" = None case.
   
   I also observed that the testcase I added is using `range` distribution-mode.
   
   I didn't find whether `hash` distribution-mode will use sortkey. Because table property documentation doesn't mention about it. 
   
   > Defines distribution of write data: none: don’t shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder
   
   Meanwhile, I am trying to read about this. Please let me know if you know of any other cases where data won't be sorted when the sort order is configured. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053496008


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -654,6 +654,7 @@ public DataWriter<InternalRow> createWriter(int partitionId, long taskId, long e
               .dataFileFormat(format)
               .dataSchema(writeSchema)
               .dataSparkType(dsSchema)
+              .dataSortOrder(table.sortOrder())

Review Comment:
   "write.distribution-mode" is set to None would be one way



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053287377


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRequiredDistributionAndOrdering.java:
##########
@@ -298,4 +300,33 @@ public void testRangeDistributionWithQuotedColumnNames() throws NoSuchTableExcep
         ImmutableList.of(row(7L)),
         sql("SELECT count(*) FROM %s", tableName));
   }
+
+  @Test
+  public void testSortOrderIdInV1Manifests() {
+    testSortOrder("1");
+  }
+
+  @Test
+  public void testSortOrderIdInV2Manifests() {
+    testSortOrder("2");
+  }
+
+  private void testSortOrder(String formatVersion) {
+    sql(
+        "CREATE TABLE %s (c1 INT, c2 STRING, c3 STRING) USING iceberg TBLPROPERTIES('format-version' = '%s')",
+        tableName, formatVersion);
+    Table table = validationCatalog.loadTable(tableIdent);
+    Assert.assertEquals("Table should be unsorted", 0, table.sortOrder().orderId());
+
+    sql("ALTER TABLE %s WRITE ORDERED BY c2", tableName);
+    table.refresh();
+    Assert.assertEquals("Table sort order id should be 1", 1, table.sortOrder().orderId());
+
+    sql("INSERT INTO %s SELECT 43, 'cc', 'dd'", tableName);
+    // check the contents from the manifest files
+    assertEquals(
+        "Sort order in manifest entry should match the table sort order",
+        ImmutableList.of(row(1)),

Review Comment:
   This was always returning 0 as the `manifest file` was never storing `sort_order_id` from Spark. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382677828

> I'm still a little worried about saving this information, what does knowing the sort order mean for a file. Are we guaranteeing that the file is locally sorted by that order? or globally sorted with files added in that snapshot. I'm cautious about doing anything here until we really define what it means.

**Spec says this. I am just trying to make the implementation the same as the spec. I am ok with changing the spec or implementation.**

> `140 sort_order_id` - ID representing sort order for this file [3].
If the sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](https://iceberg.apache.org/spec/#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

cc: @rdblue

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053445334


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRequiredDistributionAndOrdering.java:
##########
@@ -298,4 +300,33 @@ public void testRangeDistributionWithQuotedColumnNames() throws NoSuchTableExcep
         ImmutableList.of(row(7L)),
         sql("SELECT count(*) FROM %s", tableName));
   }
+
+  @Test

Review Comment:
   Can you please elaborate on how to do that? I am not very familiar with that.
   
   I was thinking that just mentioning the corresponding table sort order id in the manifest info is enough for time travel queries. 
   Also, I am wondering how the file scan task currently working without the sort order id (in case of sort order evolution) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.4: Store sort-order-id in manifest_entry's data_file

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1194885610


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -620,20 +626,23 @@ private static class WriterFactory implements DataWriterFactory, StreamingDataWr
     private final Schema writeSchema;
     private final StructType dsSchema;
     private final boolean partitionedFanoutEnabled;
+    private final boolean isOrderedWrite;

Review Comment:
   This `isOrderedWrite` is decided based on `requiredOrdering` field of `SparkWrite` which is an Array of Spark SortOrder. I don't know how to convert array of spark sort order to an Iceberg sort order (didn't find an existing code). I guess only Iceberg sort order has the sort order ID.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1379088280

   @RussellSpitzer: After checking some code. I found that callers has a logic to find the required ordering (by checking distribution table properties for merge/write/update/delete). So, I think I can use that.
   
   Did some manual testing. It looks ok to me. 
   So, please let me know if this change is ok. If yes, I can add all combinations of test cases.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053285802


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRequiredDistributionAndOrdering.java:
##########
@@ -298,4 +300,33 @@ public void testRangeDistributionWithQuotedColumnNames() throws NoSuchTableExcep
         ImmutableList.of(row(7L)),
         sql("SELECT count(*) FROM %s", tableName));
   }
+
+  @Test

Review Comment:
   wanted to add a `parameterized` test. But the parent class (SparkCatalogTestBase) itself is `parameterized`. So, it cannot be nested again for specific testcase I guess. 
   
   hence added as two separate test case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053475509


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -654,6 +654,7 @@ public DataWriter<InternalRow> createWriter(int partitionId, long taskId, long e
               .dataFileFormat(format)
               .dataSchema(writeSchema)
               .dataSparkType(dsSchema)
+              .dataSortOrder(table.sortOrder())

Review Comment:
   Say I have configured sort order and I do INSERT into, can you please explain when/why the data file will not be sorted in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1382374860

   I'm still a little worried about saving this information, what does knowing the sort order mean for a file. Are we guaranteeing that the file is locally sorted by that order? or globally sorted with files added in that snapshot. I'm cautious about doing anything here until we really define what it means.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1359315864

   cc: @RussellSpitzer, @aokolnychyi, @rdblue   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1053411922


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRequiredDistributionAndOrdering.java:
##########
@@ -298,4 +300,33 @@ public void testRangeDistributionWithQuotedColumnNames() throws NoSuchTableExcep
         ImmutableList.of(row(7L)),
         sql("SELECT count(*) FROM %s", tableName));
   }
+
+  @Test

Review Comment:
   I think the bigger problem here is that we do not have a guarantee that this file was actually sorted according to the spec. We'll need to check distributions configuration and make sure the Spark writer is actually writing sorted data.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -654,6 +654,7 @@ public DataWriter<InternalRow> createWriter(int partitionId, long taskId, long e
               .dataFileFormat(format)
               .dataSchema(writeSchema)
               .dataSparkType(dsSchema)
+              .dataSortOrder(table.sortOrder())

Review Comment:
   There is no guarantee that this file is actually sorted when written using the Spark Writer. I think we'll need to do a bit of extra work for this to actually be the case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#discussion_r1155370540


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java:
##########
@@ -620,20 +626,23 @@ private static class WriterFactory implements DataWriterFactory, StreamingDataWr
     private final Schema writeSchema;
     private final StructType dsSchema;
     private final boolean partitionedFanoutEnabled;
+    private final boolean isOrderedWrite;

Review Comment:
   @ajantha-bhat, I think this should pass the sort order ID that was used instead. There isn't a guarantee that the write used the default sort order. It could be possible to choose the sort order in the future, and some of the actions already allow you to choose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org