You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/30 06:07:13 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request, #4898: Core: Add schema_id to ContentFile/ManifestFile

ConeyLiu opened a new pull request, #4898:
URL: https://github.com/apache/iceberg/pull/4898

   This is the first part of #4842. Add the schema id to DataFile/DeteFile/ManifestFile and which could be used to evaluate the filter expression based on the schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1140736564

   Hi @rdblue @kbendick, could you help to review this when you are free? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885276415


##########
core/src/main/java/org/apache/iceberg/V2Metadata.java:
##########
@@ -47,7 +47,9 @@ private V2Metadata() {
       ManifestFile.ADDED_ROWS_COUNT.asRequired(),
       ManifestFile.EXISTING_ROWS_COUNT.asRequired(),
       ManifestFile.DELETED_ROWS_COUNT.asRequired(),
-      ManifestFile.PARTITION_SUMMARIES
+      ManifestFile.PARTITION_SUMMARIES,
+      ManifestFile.KEY_METADATA,

Review Comment:
   Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r909543308


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";
+    long fileSize = 150972L;
+    FileFormat format = FileFormat.AVRO;
+
+    Metrics metrics = new Metrics(
+        1587L,
+        ImmutableMap.of(1, 15L, 2, 122L, 3, 4021L, 4, 9411L, 5, 15L), // sizes
+        ImmutableMap.of(1, 100L, 2, 100L, 3, 100L, 4, 100L, 5, 100L),  // value counts
+        ImmutableMap.of(1, 0L, 2, 0L, 3, 0L, 4, 0L, 5, 0L), // null value counts
+        ImmutableMap.of(5, 10L), // nan value counts
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1)),  // lower bounds
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);

Review Comment:
   The `OldManifestFileV1.avro/OldManifestFileV2.avro/` is the previously DataFile spec(the file instance is https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java#L74). Hi, @szehon-ho I think this test covered your concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r911652013


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";
+    long fileSize = 150972L;
+    FileFormat format = FileFormat.AVRO;
+
+    Metrics metrics = new Metrics(
+        1587L,
+        ImmutableMap.of(1, 15L, 2, 122L, 3, 4021L, 4, 9411L, 5, 15L), // sizes
+        ImmutableMap.of(1, 100L, 2, 100L, 3, 100L, 4, 100L, 5, 100L),  // value counts
+        ImmutableMap.of(1, 0L, 2, 0L, 3, 0L, 4, 0L, 5, 0L), // null value counts
+        ImmutableMap.of(5, 10L), // nan value counts
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1)),  // lower bounds
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);

Review Comment:
   I am not sure I have fully understood your ways. In this pr, we not only add a new field to the `DataFile` and also changed the return StructType of `DataFile.getType`. Shouldn't we need to customize DataFile/V1Metadata/V2Metadata to write the DataFile with old data spec? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1642172414

   Hi @rdblue @szehon-ho @aokolnychyi do you have any time to look at this again? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1141974124

   Hi @rdblue, thanks for the review. I have removed the changes for the writers. Please take another look, thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r894573714


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r941357092


##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -54,6 +54,7 @@ public PartitionData copy() {
   private Types.StructType partitionType;
 
   private Long fileOrdinal = null;
+  private int schemaId = -1;

Review Comment:
   Updated to use Integer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1210127293

   > I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema.
   
   You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information.
   
   > As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. 
   
   In RewriteManifest, we use the current table partition [spec](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L103) or the specified spec with spec ID. I think the schema used in the current space is not the same as the original schema for the old manifest file. That's because we will [rewrite the partition spec](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L952) when updating the table schema. Please correct me if I am wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r884938667


##########
core/src/main/java/org/apache/iceberg/avro/Avro.java:
##########
@@ -338,13 +340,19 @@ public DataWriteBuilder withSortOrder(SortOrder newSortOrder) {
       return this;
     }
 
+    public DataWriteBuilder withSchemaId(int newSchemaId) {
+      this.schemaId = newSchemaId;

Review Comment:
   Why isn't this taken from the schema that is passed in?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1268152345


##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -98,18 +98,20 @@ public interface DataFile extends ContentFile<DataFile> {
   Types.NestedField SORT_ORDER_ID =
       optional(140, "sort_order_id", IntegerType.get(), "Sort order ID");
   Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), "Partition spec ID");
+  Types.NestedField SCHEMA_ID = optional(142, "schema_id", IntegerType.get(), "Schema ID");
 
   int PARTITION_ID = 102;
   String PARTITION_NAME = "partition";
   String PARTITION_DOC = "Partition data tuple, schema based on the partition spec";
-  // NEXT ID TO ASSIGN: 142
+  // NEXT ID TO ASSIGN: 143
 
   static StructType getType(StructType partitionType) {
     // IDs start at 100 to leave room for changes to ManifestEntry
     return StructType.of(
         CONTENT,
         FILE_PATH,
         FILE_FORMAT,
+        SCHEMA_ID,

Review Comment:
   @szehon-ho if you are concerned about this order. We may put the `SCHEMA_ID` to last to align other new fields. Then we don't need to update too many methods of `get/put` in `BaseFile`.  And some of the spark/flink UTs are not needed to update as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r894928668


##########
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -81,6 +82,7 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     keyMetadataPosition = positions.get("key_metadata");
     splitOffsetsPosition = positions.get("split_offsets");
     sortOrderIdPosition = positions.get("sort_order_id");
+    schemaIdPosition = positions.get("schema_id");

Review Comment:
   Nit: should we put it in the consistent position, after file format?



##########
format/spec.md:
##########
@@ -441,6 +441,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`132  split_offsets`**          | `list<133: long>`            | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending |
 |            | _optional_ | **`135  equality_ids`**           | `list<136: int>`             | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
 | _optional_ | _optional_ | **`140  sort_order_id`**          | `int`                        | ID representing sort order for this file [3]. |
+| _optional_ | _optional_ | **`141  spec_id`**                | `int`                        | ID representing Partition Spec for this file |

Review Comment:
   Spec id is actually a generated field, its not persisted in spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r884936249


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   Instead of returning an incorrect value, I think this should throw an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1141285784

   @ConeyLiu, thanks for breaking this up, but this is still far too large for a single PR. There's no need to update all of the writers in a single PR. Instead, this should focus on core and API classes and just read null values for the new field. You can add writes later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1268155766


##########
core/src/test/resources/OldManifestFileV1.avro:
##########


Review Comment:
   These two manifest files are wroten with the old spec. They are used to test use spec reader reading the files wroten with old spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1185706910


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";
+    long fileSize = 150972L;
+    FileFormat format = FileFormat.AVRO;
+
+    Metrics metrics = new Metrics(
+        1587L,
+        ImmutableMap.of(1, 15L, 2, 122L, 3, 4021L, 4, 9411L, 5, 15L), // sizes
+        ImmutableMap.of(1, 100L, 2, 100L, 3, 100L, 4, 100L, 5, 100L),  // value counts
+        ImmutableMap.of(1, 0L, 2, 0L, 3, 0L, 4, 0L, 5, 0L), // null value counts
+        ImmutableMap.of(5, 10L), // nan value counts
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1)),  // lower bounds
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);

Review Comment:
   Yea I was thinking to make a Test version of those writers..  not sure if its possible.  Anyway, I guess this works too, only issue is wont be too debuggable if something goes wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1536947668

   Thanks @szehon-ho @aokolnychyi
   
   > Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?
   
   We don't need to change those existing evaluators. Just need to create a `SchemaEvaluator` by the filter expression and schema, and use it to evaluate the file by its schema.. You can refer [here](https://github.com/apache/iceberg/pull/4842/files#diff-79baa98562948784b4c255ee2dd14ae3a26f875d246f1607f76d58668285ef31R206):
   ```java
   return CloseableIterable.filter(
         open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
         entry -> {
           boolean keep = entry != null;
           if (keep && schemaEvaluator != null && entry.file().schemaId() > -1) {
             // evaluate based on the schemaId
             keep = schemaEvaluator.eval(schemasById.get(entry.file().schemaId()));
           }
   
          return keep &&
               evaluator.eval(entry.file().partition()) &&
               metricsEvaluator.eval(entry.file()) &&
               inPartitionSet(entry.file());
          });
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885485188


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   I tried with throw an exception and then we need to add more `try catch` to wrap the function call. I keep this with -1 and add comments for the -1 means.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885259052


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   For newly added files, those files should have the same schema ID. I think it should be a common situation that manifest entries have the same schema ID. Such as batch write/overwrite case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r911642975


##########
core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java:
##########
@@ -42,18 +42,26 @@
   private final ByteBuffer keyMetadata;
   private final int[] equalityFieldIds;
   private final SortOrder sortOrder;
+  private final int schemaId;
   private DeleteFile deleteFile = null;
 
   public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                               PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                               SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);

Review Comment:
   This is a public class and constructor, I think we should keep the compatibility. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r884936678


##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -88,7 +89,8 @@ static StructType getType(StructType partitionType) {
         KEY_METADATA,
         SPLIT_OFFSETS,
         EQUALITY_IDS,
-        SORT_ORDER_ID
+        SORT_ORDER_ID,
+        SCHEMA_ID

Review Comment:
   I think this should be added just above `SPEC_ID`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1708187248

   Hi @rdblue @szehon-ho @aokolnychyi @RussellSpitzer @nastra, could you help to review this? This should be useful for tables with frequent column additions or deletions. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1268152345


##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -98,18 +98,20 @@ public interface DataFile extends ContentFile<DataFile> {
   Types.NestedField SORT_ORDER_ID =
       optional(140, "sort_order_id", IntegerType.get(), "Sort order ID");
   Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), "Partition spec ID");
+  Types.NestedField SCHEMA_ID = optional(142, "schema_id", IntegerType.get(), "Schema ID");
 
   int PARTITION_ID = 102;
   String PARTITION_NAME = "partition";
   String PARTITION_DOC = "Partition data tuple, schema based on the partition spec";
-  // NEXT ID TO ASSIGN: 142
+  // NEXT ID TO ASSIGN: 143
 
   static StructType getType(StructType partitionType) {
     // IDs start at 100 to leave room for changes to ManifestEntry
     return StructType.of(
         CONTENT,
         FILE_PATH,
         FILE_FORMAT,
+        SCHEMA_ID,

Review Comment:
   @szehon-ho if you are concerned about this order. We may put the `SCHEMA_ID` to last to align other new fields. Then we don't need to update too many methods of `get/put` in `BaseFile`.  And some of the spark UTs are not needed to update as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r890629572


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   Cant we use Preconditions, or just throw a runtimeException?



##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   Agree with @rdblue , I think having a datafile filtering is good for the first cut, as having manifest file keep schema id seems hard to get right and maintain, especially as any bug here is a bit risky and tricky, so definitely prefer simple approach first.



##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -63,18 +63,20 @@ public interface DataFile extends ContentFile<DataFile> {
       "Equality comparison field IDs");
   Types.NestedField SORT_ORDER_ID = optional(140, "sort_order_id", IntegerType.get(), "Sort order ID");
   Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), "Partition spec ID");
+  Types.NestedField SCHEMA_ID = optional(142, "schema_id", IntegerType.get(), "Schema ID");

Review Comment:
   Note: it may need a follow up note in spec that 142 field is reserved, like: https://github.com/apache/iceberg/pull/4750



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1169897759

   Hi @rdblue @szehon-ho, I am sorry for the late update. The compatible test has been added. Hopeful, you could take another look when you are free.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r941355938


##########
core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java:
##########
@@ -42,18 +42,26 @@
   private final ByteBuffer keyMetadata;
   private final int[] equalityFieldIds;
   private final SortOrder sortOrder;
+  private final int schemaId;
   private DeleteFile deleteFile = null;
 
   public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                               PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                               SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);

Review Comment:
   Should we do it in this patch or a separate patch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r940733614


##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -54,6 +54,7 @@ public PartitionData copy() {
   private Types.StructType partitionType;
 
   private Long fileOrdinal = null;
+  private int schemaId = -1;

Review Comment:
   I think partitionSpecId is different because that's a derived field. Slightly prefer Integer more reflects what's persisted on disk.



##########
core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java:
##########
@@ -42,18 +42,26 @@
   private final ByteBuffer keyMetadata;
   private final int[] equalityFieldIds;
   private final SortOrder sortOrder;
+  private final int schemaId;
   private DeleteFile deleteFile = null;
 
   public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                               PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                               SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);

Review Comment:
   I'm not a huge fan of it, I think we can have a builder like SparkAppenderFactory (where we did some refactor to not have one constructor per new argument)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885259052


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   ~~ For newly added files, those files should have the same schema ID. I think it should be a common situation that manifest entries have the same schema ID. Such as batch write/overwrite case.  ~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r894580150


##########
core/src/main/java/org/apache/iceberg/avro/Avro.java:
##########
@@ -338,13 +340,19 @@ public DataWriteBuilder withSortOrder(SortOrder newSortOrder) {
       return this;
     }
 
+    public DataWriteBuilder withSchemaId(int newSchemaId) {
+      this.schemaId = newSchemaId;

Review Comment:
   You could find it in the following code. The `new Schema(struct.fields())` will create the schema with the default schema ID 0.
   
   ```java
     Schema writeSchema = validateOrMergeWriteSchema(table, dsSchema, writeConf);
   
     // validateOrMergeWriteSchema
     private static Schema validateOrMergeWriteSchema(Table table, StructType dsSchema, SparkWriteConf writeConf) {
       Schema writeSchema;
       if (writeConf.mergeSchema()) {
         ...
       } else {
         writeSchema = SparkSchemaUtil.convert(table.schema(), dsSchema);
         TypeUtil.validateWriteSchema(
             table.schema(), writeSchema, writeConf.checkNullability(), writeConf.checkOrdering());
       }
   
       return writeSchema;
     }
   
     // SparkSchemaUtil.convert
     public static Schema convert(Schema baseSchema, StructType sparkType) {
       // convert to a type with fresh ids
       Types.StructType struct = SparkTypeVisitor.visit(sparkType, new SparkTypeToType(sparkType)).asStructType();
       // reassign ids to match the base schema
       Schema schema = TypeUtil.reassignIds(new Schema(struct.fields()), baseSchema);
       // fix types that can't be represented in Spark (UUID and Fixed)
       return SparkFixupTypes.fixup(schema, baseSchema);
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1152409177

   Thanks, @rdblue @szehon-ho for the review. Comments have been addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1156396262

   Thanks @szehon-ho for the review and suggestion.
   
   > Maybe a test writer that creates metadata files with all optional columns as null? That way can test all the new columns at once.
   
   I will add the test later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1211065634

   Yea you are right, it seems it will set the latest schema of each spec on the rewritten manifests, so the information is lost if you evolve schemas within a spec.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "chenjunjiedada (via GitHub)" <gi...@apache.org>.
chenjunjiedada commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1534044111

   @szehon-ho,@rdblue  Do we have any agreement here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1317153091


##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -123,7 +124,8 @@ static StructType getType(StructType partitionType) {
         KEY_METADATA,
         SPLIT_OFFSETS,
         EQUALITY_IDS,
-        SORT_ORDER_ID);
+        SORT_ORDER_ID,
+        SCHEMA_ID);

Review Comment:
   Put it in the last one to reduce the code changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r884938128


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   What is the value of adding the schema ID to a manifest file? Manifests can contain files with different schemas and I don't think there is a way to filter manifests based on the schema of the data files it may contain. At best, you could fill this in if all of the files in the manifest had the same schema ID, but that's really unlikely. I'm not sure it would be worth the code and the metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885259052


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   For newly added files, those files should have the same schema ID. I think it should be a common situation that manifest entries have the same schema ID. Such as batch write/overwrite case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885485188


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   I tried with throw an exception and then we need to add more `try catch` the wrap the function call. I keep this with -1 and add comments for the -1 means.



##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -88,7 +89,8 @@ static StructType getType(StructType partitionType) {
         KEY_METADATA,
         SPLIT_OFFSETS,
         EQUALITY_IDS,
-        SORT_ORDER_ID
+        SORT_ORDER_ID,
+        SCHEMA_ID

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r891766960


##########
api/src/main/java/org/apache/iceberg/ContentFile.java:
##########
@@ -39,6 +39,13 @@
    */
   int specId();
 
+  /**
+   * Return id of the schema when write this file.
+   */
+  default int schemaId() {
+    return -1;

Review Comment:
   None of the implementations should actually use this implementation. The `default` is needed for binary compatibility, but if anyone calls this it should throw an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1152930373

   > I wonder, is it possible to add a test to try to deserialize an older manifest entry without schema_id?
   
   It seems to need to implement customized V1Writer/V2Writer and those `Indexed` classes. Is there any suggestion for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r894574178


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -64,14 +64,15 @@ public interface ManifestFile {
       "Summary for each partition");
   Types.NestedField KEY_METADATA = optional(519, "key_metadata", Types.BinaryType.get(),
       "Encryption key metadata blob");
-  // next ID to assign: 520
+  Types.NestedField SCHEMA_ID = optional(520, "schema_id", Types.IntegerType.get(), "Schema ID");

Review Comment:
   Removed the schema ID for manifest file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1209652918

   @ConeyLiu that's a good question, I think (may be wrong) rewriteDataFiles groups files by partition/partition spec, and may not preserve the old schemas.  Ie, all the data files are rewritten with latest schema of that partition spec.  
   
   I think the situation would be the same even in your proposal to add new schemaid field to data_file, right?  After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate?  Because there may be data in the new file that was written by a later schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r884939499


##########
core/src/main/java/org/apache/iceberg/V2Metadata.java:
##########
@@ -47,7 +47,9 @@ private V2Metadata() {
       ManifestFile.ADDED_ROWS_COUNT.asRequired(),
       ManifestFile.EXISTING_ROWS_COUNT.asRequired(),
       ManifestFile.DELETED_ROWS_COUNT.asRequired(),
-      ManifestFile.PARTITION_SUMMARIES
+      ManifestFile.PARTITION_SUMMARIES,
+      ManifestFile.KEY_METADATA,

Review Comment:
   Why does this add key metadata?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885261028


##########
core/src/main/java/org/apache/iceberg/avro/Avro.java:
##########
@@ -338,13 +340,19 @@ public DataWriteBuilder withSortOrder(SortOrder newSortOrder) {
       return this;
     }
 
+    public DataWriteBuilder withSchemaId(int newSchemaId) {
+      this.schemaId = newSchemaId;

Review Comment:
   I see the schema passed into the `Avro`/`Parquet`/`ORC` is the one converted from Spark/Flink schema. And that schema has the default ID `-1`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1151094879

   A little busy recently. Will address the comments tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r891766123


##########
core/src/main/java/org/apache/iceberg/avro/Avro.java:
##########
@@ -338,13 +340,19 @@ public DataWriteBuilder withSortOrder(SortOrder newSortOrder) {
       return this;
     }
 
+    public DataWriteBuilder withSchemaId(int newSchemaId) {
+      this.schemaId = newSchemaId;

Review Comment:
   Where is this happening?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r911461326


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";
+    long fileSize = 150972L;
+    FileFormat format = FileFormat.AVRO;
+
+    Metrics metrics = new Metrics(
+        1587L,
+        ImmutableMap.of(1, 15L, 2, 122L, 3, 4021L, 4, 9411L, 5, 15L), // sizes
+        ImmutableMap.of(1, 100L, 2, 100L, 3, 100L, 4, 100L, 5, 100L),  // value counts
+        ImmutableMap.of(1, 0L, 2, 0L, 3, 0L, 4, 0L, 5, 0L), // null value counts
+        ImmutableMap.of(5, 10L), // nan value counts
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1)),  // lower bounds
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);

Review Comment:
   This is really great to have this test.  I was initially thinking that we want to have a TestV1Writer / TestV2Writer that writes all the optional fields as null, instead of checking in the old version avro file.  Is that possible?   I can also take a look myself if that is possible.



##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";

Review Comment:
   Is this used?



##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -54,6 +54,7 @@ public PartitionData copy() {
   private Types.StructType partitionType;
 
   private Long fileOrdinal = null;
+  private int schemaId = -1;

Review Comment:
   In spec its optional, so trying to understand, why don't we use Integer?



##########
core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java:
##########
@@ -42,18 +42,26 @@
   private final ByteBuffer keyMetadata;
   private final int[] equalityFieldIds;
   private final SortOrder sortOrder;
+  private final int schemaId;
   private DeleteFile deleteFile = null;
 
   public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,
                               PartitionSpec spec, StructLike partition, EncryptionKeyMetadata keyMetadata,
                               SortOrder sortOrder, int... equalityFieldIds) {
+    this(appender, format, location, spec, partition, keyMetadata, sortOrder, -1, equalityFieldIds);

Review Comment:
   It's not that great to have to add a new constructor to all these.  As we already have WriterFactory that abstract it and should be the ones getting called, I wonder if we can just change this interface?  cc @rdblue @aokolnychyi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1154471992

   Yea there is some limited discussion in this related issue, but I guess no good conclusions:  https://github.com/apache/iceberg/issues/2542.  
   
   Maybe a test writer that creates metadata files with all optional columns as null?  That way can test all the new columns at once.
   
   By the way,  change mostly looks good, and it's good you put the new field is optional to avoid the issue like the one mentioned.  I just dont know enough about the serialization/deserialization code to be sure if there are any other problems with previous serialized metadata, so was hoping to have a test to verify it.  Though can leave to @rdblue to approve if he is confident , and we can tackle the backward compat tests on the side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r909543308


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";
+    long fileSize = 150972L;
+    FileFormat format = FileFormat.AVRO;
+
+    Metrics metrics = new Metrics(
+        1587L,
+        ImmutableMap.of(1, 15L, 2, 122L, 3, 4021L, 4, 9411L, 5, 15L), // sizes
+        ImmutableMap.of(1, 100L, 2, 100L, 3, 100L, 4, 100L, 5, 100L),  // value counts
+        ImmutableMap.of(1, 0L, 2, 0L, 3, 0L, 4, 0L, 5, 0L), // null value counts
+        ImmutableMap.of(5, 10L), // nan value counts
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1)),  // lower bounds
+        ImmutableMap.of(1, Conversions.toByteBuffer(Types.IntegerType.get(), 1))); // upper bounds
+    Integer sortOrderId = 2;
+
+    String fileName = String.format("OldManifestFileV%s.avro", formatVersion);

Review Comment:
   The `OldManifestFileV1.avro/OldManifestFileV2.avro/` is the previous DataFile(the file instance is https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java#L74). Hi, @szehon-ho I think this test covered your concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile

Posted by "ConeyLiu (via GitHub)" <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r1317158883


##########
api/src/main/java/org/apache/iceberg/DataFile.java:
##########
@@ -98,18 +98,20 @@ public interface DataFile extends ContentFile<DataFile> {
   Types.NestedField SORT_ORDER_ID =
       optional(140, "sort_order_id", IntegerType.get(), "Sort order ID");
   Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), "Partition spec ID");
+  Types.NestedField SCHEMA_ID = optional(142, "schema_id", IntegerType.get(), "Schema ID");
 
   int PARTITION_ID = 102;
   String PARTITION_NAME = "partition";
   String PARTITION_DOC = "Partition data tuple, schema based on the partition spec";
-  // NEXT ID TO ASSIGN: 142
+  // NEXT ID TO ASSIGN: 143
 
   static StructType getType(StructType partitionType) {
     // IDs start at 100 to leave room for changes to ManifestEntry
     return StructType.of(
         CONTENT,
         FILE_PATH,
         FILE_FORMAT,
+        SCHEMA_ID,

Review Comment:
   Moved the last



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1209433520

   Hi @szehon-ho thanks for reviewing this again. 
   
   > Also I noticed, spec-id and schema is already written in the header of each manifest. As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. Wondering at a high level, is it adequate for the optimization you are planning?
   
   What happened after the rewritten data file? It seems like the schema of the manifest file is the current table schema. And those manifest entries in the same manifest file could have a different schema after rewrite. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r895027853


##########
format/spec.md:
##########
@@ -441,6 +441,8 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`132  split_offsets`**          | `list<133: long>`            | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending |
 |            | _optional_ | **`135  equality_ids`**           | `list<136: int>`             | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
 | _optional_ | _optional_ | **`140  sort_order_id`**          | `int`                        | ID representing sort order for this file [3]. |
+| _optional_ | _optional_ | **`141  spec_id`**                | `int`                        | ID representing Partition Spec for this file |

Review Comment:
   Thanks for the explanation. Updated 



##########
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java:
##########
@@ -81,6 +82,7 @@ public SparkDataFile(Types.StructType type, StructType sparkType) {
     keyMetadataPosition = positions.get("key_metadata");
     splitOffsetsPosition = positions.get("split_offsets");
     sortOrderIdPosition = positions.get("sort_order_id");
+    schemaIdPosition = positions.get("schema_id");

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r911644907


##########
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##########
@@ -142,9 +147,61 @@ public void testDeleteFilePositions() throws IOException {
       long expectedPos = 0L;
       for (DeleteFile file : reader) {
         Assert.assertEquals("Position should match", (Long) expectedPos, file.pos());
-        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(17));
+        Assert.assertEquals("Position from field index should match", expectedPos, ((BaseFile) file).get(18));
         expectedPos += 1;
       }
     }
   }
+
+  @Test
+  public void testManifestCompatible() throws IOException {
+    Schema schema = new Schema(
+        required(1, "id", Types.LongType.get()),
+        required(2, "timestamp", Types.TimestampType.withZone()),
+        required(3, "category", Types.StringType.get()),
+        required(4, "data", Types.StringType.get()),
+        required(5, "double", Types.DoubleType.get()));
+
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .identity("category")
+        .hour("timestamp")
+        .bucket("id", 16)
+        .build();
+    PartitionData partition = DataFiles.data(spec, "category=cheesy/timestamp_hour=10/id_bucket=3");
+
+    String path = "s3://bucket/table/category=cheesy/timestamp_hour=10/id_bucket=3/file.avro";

Review Comment:
   There is a check: `Assert.assertEquals("File path should match", path, entry.file().path());`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r911644215


##########
core/src/main/java/org/apache/iceberg/BaseFile.java:
##########
@@ -54,6 +54,7 @@ public PartitionData copy() {
   private Types.StructType partitionType;
 
   private Long fileOrdinal = null;
+  private int schemaId = -1;

Review Comment:
   I am OK with either one. I followed the `partitionSpecId` behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885276415


##########
core/src/main/java/org/apache/iceberg/V2Metadata.java:
##########
@@ -47,7 +47,9 @@ private V2Metadata() {
       ManifestFile.ADDED_ROWS_COUNT.asRequired(),
       ManifestFile.EXISTING_ROWS_COUNT.asRequired(),
       ManifestFile.DELETED_ROWS_COUNT.asRequired(),
-      ManifestFile.PARTITION_SUMMARIES
+      ManifestFile.PARTITION_SUMMARIES,
+      ManifestFile.KEY_METADATA,

Review Comment:
   I am not sure this is right, please correct me. Because I need to keep the schema matching the `IndexedManifestFile.get` method. 
   
   ```java
           case 13:
             return wrapped.partitions();
           case 14:
             return wrapped.keyMetadata();
           case 15:
             return wrapped.schemaId();
           default:
             throw new UnsupportedOperationException("Unknown field ordinal: " + pos);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#discussion_r885261028


##########
core/src/main/java/org/apache/iceberg/avro/Avro.java:
##########
@@ -338,13 +340,19 @@ public DataWriteBuilder withSortOrder(SortOrder newSortOrder) {
       return this;
     }
 
+    public DataWriteBuilder withSchemaId(int newSchemaId) {
+      this.schemaId = newSchemaId;

Review Comment:
   I see the schema passed into the `Avro`/`Parquet`/`ORC` is the one converted from Spark/Flink schema. And that schema has the default ID `0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1149243949

   > Is this something we can do or need to wait until V3?
   
   @szehon-ho, this is something we can do because it is backward compatible. Older readers will ignore new fields, so we can add it safely. And if we don't have a schema ID for a column, then we just return null and skip the optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4898: Core: Add schema_id to ContentFile/ManifestFile

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1536686313

   Will this mean all evaluator logic will have to change to be schema specific? Is there a simple example how this will be consumed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org