You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/01 15:21:32 UTC

[GitHub] [incubator-iceberg] chenjunjiedada opened a new pull request #887: Define file and position based deletion file in spec

chenjunjiedada opened a new pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887
 
 
   This is for https://github.com/apache/incubator-iceberg/issues/359

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402470353
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,38 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema as following:
+```json
+{
+  "type": "struct",
+  "fields": [ {
+    "id": 1,
+    "name": "file_path",
+    "required": true,
+    "type": "string",
+    "doc": "The full URI of a data file, with FS scheme. This must match the file_path of the target data file in a manifest entry."
 
 Review comment:
   I don't think that requirements for these fields are obvious enough in docs. Let's use the doc field to describe what it is, but the spec requirements should be pulled out and clear.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401755240
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
 
 Review comment:
   This needs to clearly define what file and position are. I think that `file` should be renamed to `file_path` to match tracking in the manifest file and should use a similar description, along with a note that it must match what's in the manifest. We also need to note that positions start at 0.
   
   * `file_path` - The full URI of a data file, with FS scheme. This must match the `file_path` of the target data file in a manifest entry.
   * `position` - The ordinal position of a deleted row in the target data file identified by `file_path`, starting at 0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402240874
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   Sure, no problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402420890
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
 
 Review comment:
   no problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402710469
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   Just updated. Plus I use the field ID 600 and 601.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] leoluan2009 commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
leoluan2009 commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402719738
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,25 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
 
 Review comment:
   Can you remove the comma between version and 1?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402277896
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   You mean json format, right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402710469
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   Just updated. 
   - put `required` explicitly before type to emphasize it.
   - use the field ID starting from 600.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402424058
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so as to leverage the merge sort. The layout of sorted records in the deletion file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
 
 Review comment:
   Ok, let me update this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#issuecomment-608564700
 
 
   One of the changes I made was to reset the field IDs in the file/position delete struct. These are separate files, so we we can reuse IDs. The reason why we assign IDs in blocks for the other metadata files is that those are stored in the same manifest file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402247728
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so as to leverage the merge sort. The layout of sorted records in the deletion file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
+ 
+It is also worth to note that in order to keep module independence, deletion files are written with the same file format as the table's file format.
 
 Review comment:
   I see, let me rephrase this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401742458
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   Could you update this to be like the other schemas defined in the spec? It should be a table with a description of each field, Iceberg types, and should include field IDs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402420765
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
 
 Review comment:
   I put the description into `doc` property of the field. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401746298
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   Also, these fields are `required`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402465845
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,38 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema as following:
+```json
+{
+  "type": "struct",
+  "fields": [ {
+    "id": 1,
+    "name": "file_path",
+    "required": true,
+    "type": "string",
+    "doc": "The full URI of a data file, with FS scheme. This must match the file_path of the target data file in a manifest entry."
+  }, {
+    "id": 2,
+    "name": "position",
+    "required": true,
+    "type": "long",
+    "doc": "The ordinal position of a deleted row in the target data file identified by file_path, starting at 0."
+  } ]
+}
+```
+
+The rows in the delete file must be sorted by `file_path` then `position` to optimize filtering rows while scanning. 
+
+*  Sorting by `file_path` allows filter pushdown by file in columnar storage formats.
+*  Sorting by `position` allows filtering rows while scanning, to avoid keeping deletes in memory.
+ 
+Though the delete files can be written using any supported data file format in Iceberg, it is recommended to write delete files with same file format as the table's file format to keep module independence.
 
 Review comment:
   Let's remove "to keep module independence" because it isn't clear what that means.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402470353
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,38 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema as following:
+```json
+{
+  "type": "struct",
+  "fields": [ {
+    "id": 1,
+    "name": "file_path",
+    "required": true,
+    "type": "string",
+    "doc": "The full URI of a data file, with FS scheme. This must match the file_path of the target data file in a manifest entry."
 
 Review comment:
   I don't think that requirements for these fields are obvious enough in docs. Let's use the doc field to describe what it is, but the spec requirements should be pulled out and clear.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401762563
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
 
 Review comment:
   Why is this a sub-section of _Manifests_?
   
   Let's create a new section after _Table Metadata_ called _Delete Format_, with a sub-section for _Position-based delete files_. Then we should note that content in that section is in preparation for incrementing the format version:
   
   > This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401741386
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -232,6 +232,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file.                                         |
 | **`131  key_metadata`**           | `optional binary`                     | Implementation-specific key metadata for encryption                                                                                                                                                  |
 | **`132  split_offsets`**          | `optional list`                       | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.                                                                                     |
+| **`134  file_type`**              | `optional int`                        | Type of the data file. `0`: normal data file that has the same schema as the table's schema. `1`: file and position based deletion file.                                                         |
 
 Review comment:
   This change should not be in this PR. The purpose of this PR is to define the format of a file and position delete file, not to make additional changes to metadata. Could you remove this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402240539
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -232,6 +232,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | **`128  upper_bounds`**           | `optional map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all values in the column for the file.                                         |
 | **`131  key_metadata`**           | `optional binary`                     | Implementation-specific key metadata for encryption                                                                                                                                                  |
 | **`132  split_offsets`**          | `optional list`                       | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending.                                                                                     |
+| **`134  file_type`**              | `optional int`                        | Type of the data file. `0`: normal data file that has the same schema as the table's schema. `1`: file and position based deletion file.                                                         |
 
 Review comment:
   Sure, will move this to the file_type PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402465027
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
 
 Review comment:
   No, I meant table. Look at the other struct type definitions and copy those. The table format makes it easy to read.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402250384
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
 
 Review comment:
   OK, let me update this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402710508
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,38 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema as following:
+```json
+{
+  "type": "struct",
+  "fields": [ {
+    "id": 1,
+    "name": "file_path",
+    "required": true,
+    "type": "string",
+    "doc": "The full URI of a data file, with FS scheme. This must match the file_path of the target data file in a manifest entry."
+  }, {
+    "id": 2,
+    "name": "position",
+    "required": true,
+    "type": "long",
+    "doc": "The ordinal position of a deleted row in the target data file identified by file_path, starting at 0."
+  } ]
+}
+```
+
+The rows in the delete file must be sorted by `file_path` then `position` to optimize filtering rows while scanning. 
+
+*  Sorting by `file_path` allows filter pushdown by file in columnar storage formats.
+*  Sorting by `position` allows filtering rows while scanning, to avoid keeping deletes in memory.
+ 
+Though the delete files can be written using any supported data file format in Iceberg, it is recommended to write delete files with same file format as the table's file format to keep module independence.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#issuecomment-607360513
 
 
   Thanks, @chenjunjiedada! This looks like a good start. I've added a few comments where I'd like to make the spec more clear.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401746023
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so as to leverage the merge sort. The layout of sorted records in the deletion file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
+ 
+It is also worth to note that in order to keep module independence, deletion files are written with the same file format as the table's file format.
 
 Review comment:
   This is a recommendation not a requirement, so we should specifically say that. The requirement is that a delete file can be written using any supported data file format.
   
   Also, the purpose of the recommendation is not module independence. People choose file formats based on what they use for most tables and have experience tuning, so it makes sense to use the same format for delete files and delta files.
   
   It's convenient to not need to build a service with a dependency on iceberg-orc or iceberg-parquet if all data and delete files are Avro, but we don't want to have people misinterpret the spec and think that there is a guarantee that delete file formats match data files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue merged pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r403165325
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,25 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema named "position_based_delete_file" with a struct of the following fields:
+
+| Field id, name          | Type                            | Description                                                                                                              |
+|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------|
+| **`600 file_path`**     | `required string`               | The full URI of a data file with FS scheme. This must match the file_path of the target data file in a manifest entry.   |
+| **`601 position`**      | `required long`                 | The ordinal position of a deleted row in the target data file identified by file_path, starting at 0.                    |
 
 Review comment:
   Here as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r403166671
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,25 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema named "position_based_delete_file" with a struct of the following fields:
 
 Review comment:
   Let's rephrase the second sentence so that it matches the ones above:
   
   > Position-based delete files store `file_position_delete`, a struct with the following fields:
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402245850
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
 
 Review comment:
   Make sense!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r402851653
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,25 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version, 1. This part of the spec is not yet complete and will be completed as format version 2.
 
 Review comment:
   Sure. I force pushed the commit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401751813
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
 
 Review comment:
   Let's make this specific to position-based delete files instead of describing delete files generally. How about this:
   
   > Position-based delete files identify rows in one or more data files that have been deleted.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401748494
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
+
+Deletion files are files that indicate deletions of pre-existing rows to be applied to the dataset at read time. Deletion files may either specify rows by column value or by file name and row position.
+
+1. The file and position based deletion file has the schema as following:
+```
+{
+  filename string,
+  position long
+}
+```
+
+The rows in the deletion file must be sorted by `filename` and `position` so as to leverage the merge sort. The layout of sorted records in the deletion file looks like:
+```
+file1, 1
+file1, 2
+file1, 5
+file2, 3
+file2, 4
+file2, 7
+file3, 6
+file3, 8
+file3, 9
+```
 
 Review comment:
   I don't think this is necessary. It should be sufficient to say that the delete file must be sorted by ascending filename then position.
   
   This is for two reasons:
   1. Sorting by file allows filter pushdown by file in columnar storage formats.
   2. Sorting by position allows filtering rows while scanning, to avoid keeping deletes in memory.
   
   I think it would help to rephrase "so as to leverage the merge sort" to "allow filtering rows while scanning" or something similar. Although we think of this as merge-sort, the operation is not a sort that produces a sorted list of deletes and rows -- it's a filter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r401750250
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -254,6 +255,33 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains the file as “live” data is garbage collected. But this is harder to detect and requires finding the diff of multiple snapshots. It is easier to track what files are deleted in a snapshot and delete them when that snapshot expires.
 
+#### Deletion Files
 
 Review comment:
   This section should be "Position-based delete files" to distinguish these from equality delete files that will be added later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#issuecomment-608563983
 
 
   Thanks, @chenjunjiedada! I made a couple of minor changes and merged this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #887: Define file and position based deletion file in spec

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #887: Define file and position based deletion file in spec
URL: https://github.com/apache/incubator-iceberg/pull/887#discussion_r403165204
 
 

 ##########
 File path: site/docs/spec.md
 ##########
 @@ -333,6 +333,25 @@ Table metadata is stored as JSON. Each table metadata change creates a new table
 
 The atomic operation used to commit metadata depends on how tables are tracked and is not standardized by this spec. See the sections below for examples.
 
+### Delete Format
+
+This section details how to encode row-level deletes in Iceberg metadata. Row-level deletes are not supported in the current format version 1. This part of the spec is not yet complete and will be completed as format version 2.
+
+#### Position-based Delete Files
+
+Position-based delete files identify rows in one or more data files that have been deleted. It has the schema named "position_based_delete_file" with a struct of the following fields:
+
+| Field id, name          | Type                            | Description                                                                                                              |
+|-------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------------------------|
+| **`600 file_path`**     | `required string`               | The full URI of a data file with FS scheme. This must match the file_path of the target data file in a manifest entry.   |
 
 Review comment:
   Can you add back-ticks to `file_path` to format it correctly?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org