You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/07/22 10:23:14 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #5338: Spec: Inconsistency around files_count

Fokko opened a new pull request, #5338:
URL: https://github.com/apache/iceberg/pull/5338

   When building the Manifest mappers for Python, @rdblue  noticed that the `added_data_files_count` should be `added_files_count` according to the spec. 
   
   However, this field is written in Java as `added_data_files_count` to Avro by the Java implementation:  
   https://github.com/apache/iceberg/blob/8104769f81ba79338fd3c94d5bd9267f22d31ed7/api/src/main/java/org/apache/iceberg/ManifestFile.java#L44-L49
   
   Luckily this doesn't affect the reading/writing because it is position based.
   
   However, it is confusing. I think we should resolve this. We could either do this by changing the Java impl, which probably works, but we could also change the spec. I know that this isn't something lightweight, but we could take it into consideration.
   
   I think we should also update the references in the code to `{added,existing,deleted}_data_files_count` to make everything consistent and avoid confusion in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r934042618


##########
.palantir/revapi.yml:
##########
@@ -1,6 +1,17 @@
 versionOverrides:
   org.apache.iceberg:iceberg-api:apache-iceberg-0.14.0: "0.14.0"
 acceptedBreaks:
+  list:
+    org.apache.iceberg:iceberg-api:
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.ADDED_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.DELETED_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.EXISTING_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"

Review Comment:
   I like that a lot, thanks 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r933876300


##########
.palantir/revapi.yml:
##########
@@ -1,6 +1,17 @@
 versionOverrides:
   org.apache.iceberg:iceberg-api:apache-iceberg-0.14.0: "0.14.0"
 acceptedBreaks:
+  list:
+    org.apache.iceberg:iceberg-api:
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.ADDED_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.DELETED_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"
+    - code: "java.field.removed"
+      old: "field org.apache.iceberg.ManifestFile.EXISTING_FILES_COUNT"
+      justification: "Removes inconsistency between code and spec"

Review Comment:
   Since we are going to be releasing 1.0 and can't deprecate these properly, can we leave the static field name intact and make it an alias for the correct one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r1025186972


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -52,11 +52,11 @@ public interface ManifestFile {
       optional(
           503, "added_snapshot_id", Types.LongType.get(), "Snapshot ID that added the manifest");
   Types.NestedField ADDED_FILES_COUNT =
-      optional(504, "added_data_files_count", Types.IntegerType.get(), "Added entry count");
+      optional(504, "added_files_count", Types.IntegerType.get(), "Added entry count");
   Types.NestedField EXISTING_FILES_COUNT =
-      optional(505, "existing_data_files_count", Types.IntegerType.get(), "Existing entry count");
+      optional(505, "existing_files_count", Types.IntegerType.get(), "Existing entry count");
   Types.NestedField DELETED_FILES_COUNT =
-      optional(506, "deleted_data_files_count", Types.IntegerType.get(), "Deleted entry count");
+      optional(506, "deleted_files_count", Types.IntegerType.get(), "Deleted entry count");

Review Comment:
   Hey @ajantha-bhat Thanks, that's actually a good point. However, If we go that route, I'd prefer to do that separately because of two reasons:
   - Looking at the code, I also see that we occasionally fetch the struct by name.
   - The code and spec are consistent, so we have to change them both



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Inconsistency around files_count [iceberg]

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#issuecomment-1749550427

   Thanks for the review @rdblue 🙌 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r928714931


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -41,11 +41,11 @@ public interface ManifestFile {
       "Lowest sequence number in the manifest");
   Types.NestedField SNAPSHOT_ID = optional(503, "added_snapshot_id", Types.LongType.get(),
       "Snapshot ID that added the manifest");
-  Types.NestedField ADDED_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),

Review Comment:
   It was already mandatory for v2. But this was already the case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Spec: Inconsistency around files_count [iceberg]

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko merged PR #5338:
URL: https://github.com/apache/iceberg/pull/5338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r927752961


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -41,11 +41,11 @@ public interface ManifestFile {
       "Lowest sequence number in the manifest");
   Types.NestedField SNAPSHOT_ID = optional(503, "added_snapshot_id", Types.LongType.get(),
       "Snapshot ID that added the manifest");
-  Types.NestedField ADDED_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),
+  Types.NestedField ADDED_DATA_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),

Review Comment:
   These changes look good to me. I think we can also update the field names. Those won't break anything because the only names that matter are the ones in the read schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r968292957


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -52,11 +52,11 @@ public interface ManifestFile {
       optional(
           503, "added_snapshot_id", Types.LongType.get(), "Snapshot ID that added the manifest");
   Types.NestedField ADDED_FILES_COUNT =
-      optional(504, "added_data_files_count", Types.IntegerType.get(), "Added entry count");
+      optional(504, "added_files_count", Types.IntegerType.get(), "Added entry count");
   Types.NestedField EXISTING_FILES_COUNT =
-      optional(505, "existing_data_files_count", Types.IntegerType.get(), "Existing entry count");
+      optional(505, "existing_files_count", Types.IntegerType.get(), "Existing entry count");
   Types.NestedField DELETED_FILES_COUNT =
-      optional(506, "deleted_data_files_count", Types.IntegerType.get(), "Deleted entry count");
+      optional(506, "deleted_files_count", Types.IntegerType.get(), "Deleted entry count");

Review Comment:
   I thought we change the spec document instead of changing the schema (field names) here.
   I saw the comments down below. Agree that this doesn't break compatibility. 
   
   But If we follow this path, `manifest_entry` has a field called `data_file` which represents both data and delete file (https://github.com/apache/iceberg/blame/master/format/spec.md#L421) May be we need to rename that aswell?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko closed pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
Fokko closed pull request #5338: Spec: Inconsistency around files_count
URL: https://github.com/apache/iceberg/pull/5338


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r927825917


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -41,11 +41,11 @@ public interface ManifestFile {
       "Lowest sequence number in the manifest");
   Types.NestedField SNAPSHOT_ID = optional(503, "added_snapshot_id", Types.LongType.get(),
       "Snapshot ID that added the manifest");
-  Types.NestedField ADDED_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),

Review Comment:
   Ugh, is this now used to track the number of added delete files in our delete manifests too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#issuecomment-1207495393

   @Fokko, sorry, but I think I was wrong in my initial review. I think instead of renaming these, we actually want to rename the fields that we write so that they match the spec. Because we use the same column for both `DataFile` and `DeleteFile` to count the number of `ADDED` rows, we should use a neutral name so `added_files_count` is correct.
   
   Looks like the problem here is that the original `added_data_files_count` is used instead of `added_files_count`, not that the variable names don't match.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r933876206


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -41,11 +41,11 @@ public interface ManifestFile {
       "Lowest sequence number in the manifest");
   Types.NestedField SNAPSHOT_ID = optional(503, "added_snapshot_id", Types.LongType.get(),
       "Snapshot ID that added the manifest");
-  Types.NestedField ADDED_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),

Review Comment:
   > Ugh, is this now used to track the number of added delete files in our delete manifests too?
   
   Yeah, we decided to use the existing metadata fields for the same purpose in delete manifests. I think it makes sense -- is there a problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #5338: Spec: Inconsistency around files_count

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #5338:
URL: https://github.com/apache/iceberg/pull/5338#discussion_r1025216240


##########
api/src/main/java/org/apache/iceberg/ManifestFile.java:
##########
@@ -52,11 +52,11 @@ public interface ManifestFile {
       optional(
           503, "added_snapshot_id", Types.LongType.get(), "Snapshot ID that added the manifest");
   Types.NestedField ADDED_FILES_COUNT =
-      optional(504, "added_data_files_count", Types.IntegerType.get(), "Added entry count");
+      optional(504, "added_files_count", Types.IntegerType.get(), "Added entry count");
   Types.NestedField EXISTING_FILES_COUNT =
-      optional(505, "existing_data_files_count", Types.IntegerType.get(), "Existing entry count");
+      optional(505, "existing_files_count", Types.IntegerType.get(), "Existing entry count");
   Types.NestedField DELETED_FILES_COUNT =
-      optional(506, "deleted_data_files_count", Types.IntegerType.get(), "Deleted entry count");
+      optional(506, "deleted_files_count", Types.IntegerType.get(), "Deleted entry count");

Review Comment:
   ok. Sounds reasonable to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org