You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/27 21:10:06 UTC

[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #963: Update v2 manifests to store only DataFile (WIP)

aokolnychyi commented on a change in pull request #963:
URL: https://github.com/apache/incubator-iceberg/pull/963#discussion_r416120445



##########
File path: core/src/main/java/org/apache/iceberg/GenericDataFile.java
##########
@@ -265,52 +310,66 @@ public ByteBuffer keyMetadata() {
   }
 
   @Override
-  @SuppressWarnings("unchecked")
   public void put(int i, Object v) {
     int pos = i;
     // if the schema was projected, map the incoming ordinal to the expected one
     if (fromProjectionPos != null) {
       pos = fromProjectionPos[i];
     }
+
+    if (!(skipEntryFields && pos < 3)) {

Review comment:
       nit: I feel `!skipEntryFields || pos >= 3` would be easier to understand

##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)
+              .project(addRequiredColumns(fileProjection))
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())

Review comment:
       just out of curiosity: do we use the class loader from `GenericManifestFile` on purpose?

##########
File path: core/src/main/java/org/apache/iceberg/GenericDataFile.java
##########
@@ -377,12 +439,15 @@ public Object get(int i) {
 
   @Override
   public int size() {
-    return 13;
+    return 15;

Review comment:
       We added 3 more fields but removed block size? The Indexed wrappers will still handle the block correctly?

##########
File path: core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java
##########
@@ -129,7 +129,7 @@ protected long targetSplitSize(TableOperations ops) {
     public CloseableIterable<StructLike> rows() {
       return CloseableIterable.transform(
           ManifestFiles.read(manifest, io).project(fileSchema).allEntries(),
-          file -> (GenericManifestEntry) file);
+          GenericDataFile.AsManifestEntry.class::cast);

Review comment:
       This keeps the old schema because we are using `ManifestEntry.getSchema(partitionType)` in `AsManifestEntry`, right?

##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -214,16 +216,31 @@ private void cacheChanges() {
 
     switch (format) {
       case AVRO:
-        AvroIterable<ManifestEntry> reader = Avro.read(file)
-            .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
-            .rename("manifest_entry", GenericManifestEntry.class.getName())
-            .rename("partition", PartitionData.class.getName())
-            .rename("r102", PartitionData.class.getName())
-            .rename("data_file", GenericDataFile.class.getName())
-            .rename("r2", GenericDataFile.class.getName())
-            .classLoader(GenericManifestFile.class.getClassLoader())
-            .reuseContainers()
-            .build();
+        CloseableIterable<ManifestEntry> reader;
+        if (formatVersion < 2) {
+          reader = Avro.read(file)
+              .project(ManifestEntry.wrapFileSchema(fileProjection.asStruct()))
+              .rename("manifest_entry", GenericManifestEntry.class.getName())
+              .rename("partition", PartitionData.class.getName())
+              .rename("r102", PartitionData.class.getName())
+              .rename("data_file", GenericDataFile.class.getName())
+              .rename("r2", GenericDataFile.class.getName())
+              .classLoader(GenericManifestFile.class.getClassLoader())
+              .reuseContainers()
+              .build();
+        } else {
+          AvroIterable<GenericDataFile> files = Avro.read(file)

Review comment:
       External systems always write v2 manifests right now. This block ensures they will be read correctly even if the table metadata is still v1, correct? If somebody consumed the recent changes already, we could have v2 manifests of different format?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org