You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "cshannon (via GitHub)" <gi...@apache.org> on 2023/11/12 13:37:37 UTC

[I] Delete rows operation will fail to resume succesfully if no tablets exist or time column is not read [accumulo]

cshannon opened a new issue, #3946:
URL: https://github.com/apache/accumulo/issues/3946

   As part of investigating #3938 , I also tested deleting a range to see if it was Idempotent. So far after analyzing it and testing it does appear to be idempotent for both 2.1 and also for no-chop merge version which is good. This makes sense because deletion works different than merge because it doesn't copy files to a different tablet, it just fences files in existing tablets or deletes tablets by deleting all data in the tablets and these operations are fine to repeat as on retry it will just either re-fence with the same files or skip over already deleted data so not a problem.
   
   In one failure simulation I added some test code and threw an exception [here](https://github.com/apache/accumulo/blob/71f1b0fbc494342b5b659ed607ca67d5339848d1/server/manager/src/main/java/org/apache/accumulo/manager/TabletGroupWatcher.java#L781), after tablets are deleted but before the fate op updates the prev row column and finishes. I set the test to only throw the exception the first time through by using a flag so that on restart of the fate operation it would start over again and continue successfully to see if things were idempotent.
   
   When testing various deletion ranges I ran into the following 2 bugs that can occur when the deletion range is infinite (null end row) and we are deleting everything at the end of the table. The bugs apply in 2.1 as well and is not unique to no chop merge.
   
   1. The first issue is we can get a NPE in [this](https://github.com/apache/accumulo/blob/71f1b0fbc494342b5b659ed607ca67d5339848d1/server/manager/src/main/java/org/apache/accumulo/manager/TabletGroupWatcher.java#L798) section on resume. Than occur in some instances when the deletion range goes to the end of the table and we restart (but tablets still exist). Because this is a retry, the first time through had already scanned and removed everything so the `metadataTime` column never gets read so we get an NPE when calling `metadataTime.getType()`. I thnk we could fix this by just simply looking up the time type if it's null by using something like: 
   ```
    Optional.ofNullable(metadataTime).map(MetadataTime::getType)
                   .orElse(client.tableOperations().getTimeType(manager.getContext().getTableName(extent.tableId())));
   ```
   2. If the delete range includes everything (all rows deleted) then all tablets get removed. If the failure then occurs after deleting all tablets (as I simulated) but before creating the new default tablet then on restart the fate operation will never resume. This is because there will be no tablets that exist so `TabletGroupWatcher` has nothing to process to resume the fate op and it just gets hung. In this scenario I'm not quite sure the best way to handle it, it's definitely going to be an edge case but seems like we'd need to essentially re-create a new default tablet on recovery so it could resume and finish the fate op so it's not stuck forever with no tablets.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Delete rows operation will fail to resume on retry if no tablets exist or time column is not read [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon closed issue #3946: Delete rows operation will fail to resume on retry if no tablets exist or time column is not read
URL: https://github.com/apache/accumulo/issues/3946


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Delete rows operation will fail to resume on retry if no tablets exist or time column is not read [accumulo]

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on issue #3946:
URL: https://github.com/apache/accumulo/issues/3946#issuecomment-1810230955

   > My hope is maybe it would be not too hard to just detect there's only 1 tablet left as we delete the tablets and if it's the last one then to just drop only the files from that tablet.
   
   I think you should be able to easily add a check to just not delete the default tablet (the last tablet in a table; the one with the end row `tableId<`, but just fence/drop its files and update its prev End Row `~tab:~pr`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Delete rows operation will fail to resume on retry if no tablets exist or time column is not read [accumulo]

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on issue #3946:
URL: https://github.com/apache/accumulo/issues/3946#issuecomment-1808380956

   For option 1, TableOperations.getTimeType() read the time type from the first tablet in the metadata table for the table.  If the table has no tablets then it will throw an exception.
   
   One way to solve option 1 and 2 may be to never delete all of the tablets, always leave at least one tablet.  Not sure how this would look in the code though.  Wonder if it would be easiest to refactor the 2.1 and main code to follow the pattern in elasticity where delete rows only drops files or fences/chops and then calls the merge code.  That pattern never deletes all of the tablets.  Not sure if that is a good pattern to follow for 2.1 and main though.  It depends on how much complexity adding a special case to always keep at least one tablet introduces to the current delete rows code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Delete rows operation will fail to resume on retry if no tablets exist or time column is not read [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on issue #3946:
URL: https://github.com/apache/accumulo/issues/3946#issuecomment-1809297208

   Thanks @keith-turner, I will take a look and see which of those options are best. I can experiment with a special case to prevent all tablets from being deleted and compare the implementation vs trying to refactor everything to keep empty tablets and merge them away. My hope is maybe it would be not too hard to just detect there's only 1 tablet left as we delete the tablets and if it's the last one then to just drop only the files from that tablet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Delete rows operation will fail to resume on retry if no tablets exist or time column is not read [accumulo]

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on issue #3946:
URL: https://github.com/apache/accumulo/issues/3946#issuecomment-1807132063

   @keith-turner - Any thoughts on this, in particular solving problem number 2 when no tablets exist so the retry doesn't happen?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org