You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/09 12:41:55 UTC

[GitHub] [iceberg] openinx opened a new issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

openinx opened a new issue #2308:
URL: https://github.com/apache/iceberg/issues/2308


   Since we've introduced row-level delete in format v2, so we will encounter the problem that RewriteFiles and RowDelta commit the transaction at the same time.
   
   Assume that we have an iceberg table `test`, and has the following events: 
   
   ```python
   INSERT   <1, 'AAA'>
   INSERT   <2, 'BBB'>
   DELETE   <1, 'AAA'>
   ```
   
   At the timestamp `t1`,  someone start a rewrite action to rewrite the whole table. 
   
   At the timestamp `t2`,  someone start another transaction to update the rows in table `test`: 
   
   ```python
   DELETE <2, 'BBB'>
   ```
   
   At the timestamp `t3`,  the update txn (which started from `t2`) commit the txn successfully. 
   
   At the timestamp `t1`,  the rewrite action commit the txn successfully.
   
   Finally, the table will have one row `<2, 'BBB'>`, while in fact we should have no rows.  That's an unexpected bug after introducing format v2,  and we will need solution to handle it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-801576874


   @stevenzwu  I will publish the PR about seqNum changes today, it's a very simple change. 
   
   @yyanyy 
   > I think it's a reasonable approach to mark the rewritten files with the old seqNum, but I'm not sure if we necessarily need to use an old sequence ....
   
   Yeah,  it will write the manifest/data/delete files with the old seqNum, but the newly created snapshot will use an increased seqNum because we have the validation that seqNum MUST be increasing for snapshots.
   
   > since it's possible that there could be separate concurrent jobs that do rewrites from equality to positional deletes?
   
   For the rewrite action that converting equality to position deletes,  we could still use the same approach, say commit the rewrite txn with old seqNum. Because those newly produced pos-deletes applies to the files that does not change.  The position-delete issue that we discussed before is:   
   
   Seq1: (RowDelta 1)
   INSERT, <1, A>
   INSERT, <2, B>
   DELETE, <1, A>
   
   Seq2: (RowDelta 2)
   POS-DELETE, <2, B> <---- here is a position deletion
   
   Seq3: (Rewrite)
   INSERT, <2,B>
   
   The committed pos-delete could not mask the newly generated files that committed in later rewrite action. This case we have to retry the whole rewrite process. 
   
   > using sequence number of the files to determine if a certain index file is current, or outdated. 
   
   I am sorry that I was absent from the meeting I promised to attend. I misunderstood it as 1pm Beijing time (The correct time is 12am~1am Beijing actually).   Is possible to combine the snapshot seqNum with file's seqNum to solve this problem ?  Actually,  I did not fully understand this background.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max edited a comment on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max edited a comment on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856394394


   >  Luckily, we streaming update/upsert job won't produce any position delete files that delete the old data files, so that should be OK.
   
   @openinx  I think this is possible. When you insert multiple data with the same primary key very quickly and perform upsert, a lot of position delete files will be generated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
yyanyy commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-800782826


   My understanding of what @stevenzwu referred to is that, if the operation fails due new RowDelta commits coming in, the rewrite operation could potentially  takes whatever changes introduced in the new RowDelta commits and make them part of the rewrite commit along with the old job output originally produced, so that we don't have to go through the expensive recompute while ensure data correctness? e.g. in your case from the email, it might be something like:
   
   Seq1:  (RowDelta 1)
   INSERT,  <1, A> 
   INSERT,  <2, B>
   DELETE, <1, A>
   
   Seq2: (RowDelta 2)
   DELETE, <2, B>
   
   Seq3: (Rewrite)
   INSERT, <2,B> 
   DELETE, <2, B>   <--- directly taking from RowDelta 2
   
   although because equality deletes only apply to sequence numbers that are less than itself, we may need to update sequence number of the files introduced from RowDelta commits to ensure they are larger than the rewrite operation's sequence number, which means we will increment sequence numbers for more than 1 in a commit, which might not be ideal.
   
   I think the approach sounds good to me, with some caveat such as we shouldn't blindly assume no conflicts when we do such rewrite operation, as positional deletes with sequence number greater than the sequence number being used for rewritten will break as you mentioned. A similar hypothetical case is that if the rewrite only works for a given partition, and there is a positional delete that affects multiple partitions with a larger sequence number, or if the rewrite only works against files with sequence number up to N and there's an equality write with sequence number N + 1 then rewrite on the files that are pointed by this positional delete will break. 
   
   In the meanwhile I think this change does break the current assumption that the file is always written with the same sequence number as the commit/snapshot itself, and it might be possible that people are using file and manifest's sequence number to find out when they are introduced to the table, but now we are breaking this behavior. But I guess it might be fine since there are other more reliable sources like metadata tables that could help with the same information. I wonder if other people have thoughts on assumptions like this that could be break by this change. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] moon-fall commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
moon-fall commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-841749911


   I encountering this data error problem when i start a flink job to write data and another to RewriteFiles,I also think the rewritten files should use the old seqNum,I don't find PR for this issue .Has the plan been decided on? I think it's necessary to solve this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856400636


   >  But it does not work for the position delete files because once we rewrite the data files then all of the row offset are changed, finally the delete files could not be applied to the rewrite files. 
   
   Isn’t this position delete files mixed with the equality delete files to filter data files? You mean that position delete files can filter data files directly, instead of first filtering equality delete files through position delete files, and then filtering through equality delete files after filtering data files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-801280334


   > If we plan to maintain sequence number like that, then the semantic of sequence number will be: if there are some real update/insert/delete in the table data set then we should increase the sequence number, otherwise we should keep the same sequence number for the rewrite action. 
   
   @openinx can you elaborate on the seqNum behavior in your original example. My understanding is that rewrite action commit have to re-increment the sequence number after the successful delta commit.
   
   > I'm trying to commit the rewrite action with the stale seqNum that the table wrote before start the rewrite action. 
   
   Is this correct with even with the new proposed seq number behavior?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-844316952


   @openinx, I'm just back from parental leave so I'm catching up on this now. Thanks for your patience here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-799394224


   I don't think I see the data semantic issue, since the rewrite operation should fail as @stevenzwu has said, because it violates the constraint imposed by the delete.
   
   I can understand how this is a practical problem if the rewrite can never complete because the row level delete files are being added too quickly, but I don't see how this can cause different data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856394394


   
   >  Luckily, we streaming update/upsert job won't produce any position delete files that delete the old data files, so that should be OK.
   
   @stevenzwu I think this is possible. When you insert multiple data with the same primary key very quickly and perform upsert, a lot of position delete files will be generated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-800847413


   > We may need to update sequence number of the files introduced from RowDelta commits to ensure they are larger than the rewrite operation's sequence number. 
   
   I think we are thinking the similar apporach to fix the semantic issue caused by commit order. You are trying to increase the seqNum of delete files that were written in the previous RowDelta when doing the rewrite action commit, I'm trying to commit the rewrite action with the stale seqNum that the table wrote before start the rewrite action. The key point is: Let the RowDelta's equality deletions could be applied to the following Rewriten data files. 
   
   The second appoach seems to be more simplier because we don't have to find all the delete files that was introduced by the previous RowDelta txn. we could just use the current sequence number as the newly introduced rewrite manifest's sequence number.
   
   As I comment before, the equality write path don't introduce any position delete files that will be applied to the old files. So we don't have to fallback to re-do the whole rewrite action if conflicts happen. But for the batch row-level delete job, we will have to fallback to re-do the rewrite action because we have to re-evaluate the row set that are not deleted by the previous position deletion.
   
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-799180514


   > Is this a typo: t1 -> t4?
   
   Yeah,  thanks @stevenzwu.  It's a typo, sorry for that.
   
   > It have to re-apply all the expensive rewrite computation based on the latest snapshot.
   
   Re-do the expensive rewrite process is not the solution that people want because if the streaming job is checkpointing for every 1min and the rewrite action take 4min to complete , then there is probability that the expensive rewrite action will be always retrying. 
   
   Besides the expensive resource cost from retrying,  there is a data semantic issue that will confuse people much.  I mean different txn commit order between RewriteFiles and  CDC/Upsert transaction will lead to different data set for the same dataset & operations.  Then people won't trust the truth of iceberg table. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856393667


    Luckily, we streaming update/upsert job won't produce any position delete files that delete the old data files, so that should be OK.
   
   
   I think this is possible. When you insert multiple data with the same primary key very quickly and perform upsert, a lot of position delete files will be generated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu edited a comment on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
stevenzwu edited a comment on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-799470435


   If we can avoid the complete re-computation during retry, it might solve the liveliness problem of stuck in conflict-retry loop. Can the rewrite action reuse most/part of computation result from the previous try so that it can shorten the duration in the retry? Only do additional incremental computation on the delta change since last computation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-799470435


   If we can avoid the complete re-computation during retry, it might solve the practical problem of stuck in conflict-retry loop. Can the rewrite action reuse most/part of computation result from the previous try so that we can shorten the duration in the retry. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-844349495


   I just read through this so I think I'm caught up. Sounds like we are discussing two problems, one short term and one long term.
   
   The short term issue is that we have a correctness problem when a row delete is committed while a rewrite is happening. I think we should fix this problem right away using the approach that @stevenzwu suggested: rewrite should validate that there are no new delete files for the partitions that are rewritten. If there are new delete files, then the rewrite should fail. This will prevent the rewrite from succeeding right now, but I think it is important to fix correctness issues as soon as possible.
   
   The long-term issue is how to modify the rewrite operation so that we can avoid the conflict with new delete files. The approach discussed above is what we originally intended to use when we were designing the v2 metadata: a rewrite should track data files with an _older_ sequence number so that new delete files are applied correctly. That works for equality deletes, but not for position deletes. Since we're primarily concerned about streaming use cases, I think that is okay. We can continue to fail the rewrite if new position deletes are added.
   
   Also, we can't update the sequence number of the new delete files. We have to change the sequence number of the rewritten data files. That's because there may be other data files that the new delete files would apply to if the delete file's sequence number was increased. For example, if the delete were an upsert instead -- a delete and an insert for the same row ID -- then moving the delete forward would cause Iceberg to drop the inserted replacement row.
   
   @yyanyy brings up a good point about how this limits the use of sequence numbers in other cases. I think that's okay. It is better to allow the table to be compacted. We'll figure out that problem as we design the secondary metadata. I think it would be okay because the new files would just not be covered by the secondary index and would need to be scanned. So it wouldn't introduce a correctness issue, just some potential staleness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
yyanyy commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-801330831


   I think there are two seqNum concepts here: seqNum for the table/commit and seqNum for the file. I think it's a reasonable approach to mark the rewritten files with the old seqNum, but I'm not sure if we necessarily need to use an old sequence number for the commit since they are stored as part of the snapshot, and suddenly have an old seqNum within the snapshot could be confusing even if there's no other implication.
   
   > the equality write path don't introduce any position delete files that will be applied to the old files. So we don't have to fallback to re-do the whole rewrite action if conflicts happen
   
   Do we want to skip the check on commit conflicts entirely? I think we always want to do the check but just to not fail the commit unless there's positional deletes even for streaming systems, since it's possible that there could be separate concurrent jobs that do rewrites from equality to positional deletes? 
   
   Another thing that's worth noting is that there's a meeting about designing secondary index earlier today with the community, and when we discussed about rolling up secondary indexes to partition/global level (anything higher than file level index file), Anton @aokolnychyi brought up a point about using sequence number of the files to determine if a certain index file is current, or outdated. If we implement the change mentioned in this issue's proposal, we couldn't achieve that, since the data file path will be different after rewritten but the sequence number stays the same, and the system that builds index won't know that it has to update the index file which eventually will cause a discrepancy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-805625362


   @rdblue  What's your opinion about this issue ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-800021752


   @RussellSpitzer  If we just fail the rewrite operation as @stevenzwu said,  Yes it should be no data semantic issue.  The failure  could be acceptable for batch jobs, I mean we could just fail the rewrite operation if someone write few new update/deletes into the base iceberg table, the batch update/deletes job is a rare operation, the probability of conflict between rewrite operation and update operation is small.   
   
   But in streaming case, that is a different story.  Because the streaming job is always running and continues to commit the delta changes to iceberg table periodlly ( such as 1 minutes), the probability of conflict between rewrite operation and update operation is very large.  So failing the rewrite operation and re-do the whole expensive job is unacceptable for this case.  I'm trying to propose the solution that we don't have to re-do the rewrite operation,  the simplest way is following the current opportunistic concurrency control,  I mean the rewrite operation could just grab the table lock again and re-commit the generated files to iceberg table with the original sequence number from base snapshot, then the delete files will always effect the rewritten data files.  But it does not work for the position delete files because once we rewrite the data files then all of the row offset are changed, finally the delete files could not  be applied to the rewrite files.  Luckily,  we st
 reaming update/upsert job won't produce any position delete files that delete the old data files,  so that should be OK.
   
   ( btw,  I provided an unit test to demonstrate why the data semantics will be an issue if just try to reuse the files from rewrite operation here https://github.com/apache/iceberg/pull/2303/commits/27e47a95d2761fa36d7724f8eac21b75b91f280c#diff-e573276c8dbbcd32f174f0e778334aeea106c67a19efabfefedf0afa4779a499R155-R218,  the [line](https://github.com/apache/iceberg/pull/2303/commits/27e47a95d2761fa36d7724f8eac21b75b91f280c#diff-e573276c8dbbcd32f174f0e778334aeea106c67a19efabfefedf0afa4779a499R217) will be broken because the rewrite action makes the `<2, B>` visible again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-799099514


   > At the timestamp t1, the rewrite action commit the txn successfully.
   
   Is this a typo: t1 -> t4?
   
   In this case shouldn't the rewrite action commit fail due to conflict in opportunistic concurrency control?
   
   If the rewrite action took longer than the streaming write commit interval, there is a chance that rewrite action can stuck in the conflict-and-retry loop. It have to re-apply all the expensive rewrite computation based on the latest snapshot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
openinx commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-800851402


   If we plan to maintain sequence number like that, then the semantic of sequence number will be:  if there are some real update/insert/delete in the table data set then we should increase the sequence number, otherwise we should keep the same sequence number for the rewrite action.  The previous semantic is:  if there is any txn that committed to iceberg table, then the sequence number should be increased.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] moon-fall commented on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
moon-fall commented on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-841749911


   I encountering this data error problem when i start a flink job to write data and another to RewriteFiles,I also think the rewritten files should use the old seqNum,I don't find PR for this issue .Has the plan been decided on? I think it's necessary to solve this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #2308:
URL: https://github.com/apache/iceberg/issues/2308


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max removed a comment on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max removed a comment on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856393667


    Luckily, we streaming update/upsert job won't produce any position delete files that delete the old data files, so that should be OK.
   
   
   I think this is possible. When you insert multiple data with the same primary key very quickly and perform upsert, a lot of position delete files will be generated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] xwmr-max edited a comment on issue #2308: Handle the case that RewriteFiles and RowDelta commit the transaction at the same time

Posted by GitBox <gi...@apache.org>.
xwmr-max edited a comment on issue #2308:
URL: https://github.com/apache/iceberg/issues/2308#issuecomment-856400636


   >  But it does not work for the position delete files because once we rewrite the data files then all of the row offset are changed, finally the delete files could not be applied to the rewrite files. 
   
   @openinx Isn’t this position delete files mixed with the equality delete files to filter data files? You mean that position delete files can filter data files directly, instead of first filtering equality delete files through position delete files, and then filtering through equality delete files after filtering data files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org