You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/03 05:21:48 UTC

[GitHub] [iceberg] ayushchauhan0811 opened a new pull request #2782: Support all operations in incremental scan

ayushchauhan0811 opened a new pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782


   Currently, incremental scan supports only appends operation(https://github.com/apache/iceberg/issues/2690). This PR is to support other data operations too.
   
   It's WIP, I will be updating the test cases too.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-894633511


   @RussellSpitzer Need some help on how to change the test for delete files
   
   In the current test, we are listing the files b/w the snapshots. So for a snapshot range with delete operation, a file can be repeated as it will be provided by append operation as well as delete operation. You can refer this test case https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestIncrementalDataTableScan.java#L175
   
   After my change it will return the `D` file twice, so should I annotate delete files like this `-D`. I was trying a way to annotate but didn't find anything. So how should I proceed with this? 
   
   ```
   filesMatch(Lists.newArrayList("B", "C", "D", "D", "E", "I"), dataBetweenScan(1, 8));
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-875312309


   I also wonder how it should look?  If there have only been deletes for example, what will the incremental read return?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-874948842


   @ayushchauhan0811 
   Rewrite and Merge operations will have data that was previously already in the data set. Consider a compaction operation which changes no actual rows but combines files, all old files are no longer valid and are deleted and a set of new files are added. So if you check which files were "added" by this compaction operation you would see the entire table as having been "added" in this snapshot.
   
   Merge operations (copy-on-write) have a similar issue. Imagine a merge operation updates a single row in a data file. The old file will be deleted  and a new file will be created. The new file will have all the data in the old file and one additional row. If you scan this new file you will get all the data which was appended in a previous action as well as the new data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-889271554


   yes make sense, we can follow the same approach here too. I will update the PR and work on updating tests too 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-889264234


   This is close to the approach being worked on in the StructuredStreaming project, so rather than supporting all operations we just ignore the ones we can't reasonably handle. https://github.com/apache/iceberg/pull/2752 https://github.com/apache/iceberg/issues/2788
   
   I think it makes sense to allow a similar thing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-889264234


   This is close to the approach being worked on in the StructuredStreaming project, so rather than supporting all operations we just ignore the ones we can't reasonably handle. https://github.com/apache/iceberg/issues/2788
   
   I think it makes sense to allow a similar thing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-877803297


   @RussellSpitzer As suspected, my changes are giving a temp view of the table from one snapshot id to another, which isn't the correct one in the case of compaction/merge jobs. 
   
   I think when we think of incremental scan b/w two snapshots, this is working fine - An incomplete view of the table. We can give users the option to configure which operations they want to include in the scan.
   
   <img width="937" alt="snapshot" src="https://user-images.githubusercontent.com/10010065/125197685-5bdd8b80-e27c-11eb-9329-aecfbb5997a6.png">
   
   <img width="1259" alt="count" src="https://user-images.githubusercontent.com/10010065/125197682-584a0480-e27c-11eb-9af0-54736b56e0f4.png">
   
   <img width="1440" alt="snapshot summary" src="https://user-images.githubusercontent.com/10010065/125197676-52542380-e27c-11eb-88b7-ccb68b1b48e3.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JustinLeesin commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
JustinLeesin commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-876238560


   > I also wonder how it should look? If there have only been deletes for example, what will the incremental read return?
   
   if there have been deletes, then the incremental read should return the deleted dataset with tag D( indicate DELETE events), just like mysql binlog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-889262287


   @RussellSpitzer What are your thoughts on this?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-875008260


   I agree there should be a way to figure out the difference between two snapshots, unfortunately It's a bit more complicated for everything that isn't an append. :/ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-874963528


   I will test both scenarios out and post the findings here.
   
   But I think there should be a feature for incrementally scanning data b/w two snapshots. The current one only allow append operations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-874938974


   @RussellSpitzer Currently incremental scan only supports the append operations but there can be cases where we can want to include other operations too. We use a CDC job to write to an iceberg table and then use this feature to do some computation on the incremental data b/w two snapshots.
   
   I didn't understand, why will it cause issues with rewrite operations? Isn't rewrite operations like overwrite operation where there are a bunch of delete files and newly added data files. I have tested it on overwrite operations on the cdc data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-874840131


   I'm a little confused of the intention here, won't this cause issues for Rewrite/Merge operations which delete and the add back data that was already in the table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ayushchauhan0811 commented on pull request #2782: Support all operations in incremental scan

Posted by GitBox <gi...@apache.org>.
ayushchauhan0811 commented on pull request #2782:
URL: https://github.com/apache/iceberg/pull/2782#issuecomment-876528796


   > if there have been deletes, then the incremental read should return the deleted dataset with tag D( indicate DELETE events), just like mysql binlog.
   
   I don't agree with this. An incremental scan presents a temp view of the table from one snapshot id to another snapshot id. The logic for handling deletes/presenting data should be the same for an incremental scan as what it would be for the whole table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org