You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/24 16:41:45 UTC

[GitHub] [incubator-iceberg] rdblue opened a new pull request #963: Update v2 manifests to store only DataFile (WIP)

rdblue opened a new pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963


   This includes commits from #952. Once that PR is merged, I'll remove them from this one.
   
   This merges the fields from ManifestEntry into DataFile for v2 manifests. Now, v2 manifests store DataFile that has status, snapshot id, and sequence number. This should make v2 metadata easier to work with.
   
   Other notable changes:
   * ManifestEntryWrapper has been combined with the v1 IndexedManifestEntry because both are wrapper classes specific to v1
   * For v2, an IndexedDataFile is added that is used for the same purpose as the v1 IndexedManifestEntry wrapper
   * To maintain compatibility, ManifestEntry.Status is unchanged. Instead, this adds FileStatus to iceberg-api
   * This refactor exposed an existing bug in DataFile, where `get(int pos, Class<?> javaType)` was returning values based on the position in an Avro projection instead of the position in the schema expressions bind to (see `getInternal` changes)
   * ManifestReader detects whether a manifest is v1 or v2 based on a metadata property in the Avro header; we will probably want to change this and track the version in the manifest list


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416158487



##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)

Review comment:
       Yes. This ensures that the correct schema is used to read a v2 manifest.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#issuecomment-620237053


   @rdblue, seems like this breaks our metadata tests. Could you take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416157893



##########
File path: core/src/main/java/org/apache/iceberg/GenericDataFile.java
##########
@@ -377,12 +439,15 @@ public Object get(int i) {
 
   @Override
   public int size() {
-    return 13;
+    return 15;

Review comment:
       Yes, this is what the v1 indexed wrapper is for. It can translate to the v1 format.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416161543



##########
File path: core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java
##########
@@ -129,7 +129,7 @@ protected long targetSplitSize(TableOperations ops) {
     public CloseableIterable<StructLike> rows() {
       return CloseableIterable.transform(
           ManifestFiles.read(manifest, io).project(fileSchema).allEntries(),
-          file -> (GenericManifestEntry) file);
+          GenericDataFile.AsManifestEntry.class::cast);

Review comment:
       Yes. This table continues to use ManifestEntry as a view of the data.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#issuecomment-620199453


   Let me go through this now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r414914000



##########
File path: core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
##########
@@ -136,24 +141,31 @@ public void testV2ManifestRewriteWithInheritance() throws IOException {
     checkRewrittenManifest(manifest2, SEQUENCE_NUMBER, 0L);
 
     // should not inherit the v2 sequence number because it was written into the v2 manifest
-    checkRewrittenEntry(readManifest(manifest2), 0L);
+    checkEntry(readManifest(manifest2).asEntry(), ManifestEntry.Status.EXISTING, 0L);
+    checkDataFile(readManifest(manifest2), FileStatus.EXISTING, 0L);
   }
 
-  void checkEntry(ManifestEntry entry, Long expectedSequenceNumber) {
-    Assert.assertEquals("Status", ManifestEntry.Status.ADDED, entry.status());
+  void checkEntry(ManifestEntry entry, ManifestEntry.Status status, Long expectedSequenceNumber) {

Review comment:
       This changes how `checkEntry` and `checkDataFile` are used. This updates `checkEntry` to now validate the v1 view of the data by inspecting the `ManifestEntry` and the `DataFile` it wraps. Similarly, `checkDataFile` validates the v2 view where all fields are part of `DataFile`.
   
   All of the test cases now use both.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r414914000



##########
File path: core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
##########
@@ -136,24 +141,31 @@ public void testV2ManifestRewriteWithInheritance() throws IOException {
     checkRewrittenManifest(manifest2, SEQUENCE_NUMBER, 0L);
 
     // should not inherit the v2 sequence number because it was written into the v2 manifest
-    checkRewrittenEntry(readManifest(manifest2), 0L);
+    checkEntry(readManifest(manifest2).asEntry(), ManifestEntry.Status.EXISTING, 0L);
+    checkDataFile(readManifest(manifest2), FileStatus.EXISTING, 0L);
   }
 
-  void checkEntry(ManifestEntry entry, Long expectedSequenceNumber) {
-    Assert.assertEquals("Status", ManifestEntry.Status.ADDED, entry.status());
+  void checkEntry(ManifestEntry entry, ManifestEntry.Status status, Long expectedSequenceNumber) {

Review comment:
       This changes how `checkEntry` and `checkDataFile` are used. This updates `checkEntry` to now validate the v1 view of the data by inspecting the `ManifestEntry` and the `DataFile` it wraps. Similarly, `checkDataFile` validates the v2 view where all fields are part of `DataFile`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#issuecomment-625371398


   I'm closing this because we plan to keep manifest_entry and data_file as separate structs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416120445



##########
File path: core/src/main/java/org/apache/iceberg/GenericDataFile.java
##########
@@ -265,52 +310,66 @@ public ByteBuffer keyMetadata() {
   }
 
   @Override
-  @SuppressWarnings("unchecked")
   public void put(int i, Object v) {
     int pos = i;
     // if the schema was projected, map the incoming ordinal to the expected one
     if (fromProjectionPos != null) {
       pos = fromProjectionPos[i];
     }
+
+    if (!(skipEntryFields && pos < 3)) {

Review comment:
       nit: I feel `!skipEntryFields || pos >= 3` would be easier to understand

##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)
+              .project(addRequiredColumns(fileProjection))
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())

Review comment:
       just out of curiosity: do we use the class loader from `GenericManifestFile` on purpose?

##########
File path: core/src/main/java/org/apache/iceberg/GenericDataFile.java
##########
@@ -377,12 +439,15 @@ public Object get(int i) {
 
   @Override
   public int size() {
-    return 13;
+    return 15;

Review comment:
       We added 3 more fields but removed block size? The Indexed wrappers will still handle the block correctly?

##########
File path: core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java
##########
@@ -129,7 +129,7 @@ protected long targetSplitSize(TableOperations ops) {
     public CloseableIterable<StructLike> rows() {
       return CloseableIterable.transform(
           ManifestFiles.read(manifest, io).project(fileSchema).allEntries(),
-          file -> (GenericManifestEntry) file);
+          GenericDataFile.AsManifestEntry.class::cast);

Review comment:
       This keeps the old schema because we are using `ManifestEntry.getSchema(partitionType)` in `AsManifestEntry`, right?

##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)

Review comment:
       External systems always write v2 manifests right now. This block ensures they will be read correctly even if the table metadata is still v1, correct? If somebody consumed the recent changes already, we could have v2 manifests of different format?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416159760



##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)

Review comment:
       It is very likely that we will also track this in the manifest list as well, but this is a good work-around to keep the changes separate. And it will help us detect if other metadata is ever wrong.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org