You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/07/31 18:11:32 UTC

[GitHub] [accumulo] ivakegg opened a new issue #1669: Automatic merging strategies

ivakegg opened a new issue #1669:
URL: https://github.com/apache/accumulo/issues/1669


   When significant data is deleted from consecutive tablets, they could be merged and still fall under the configured split threshold.  Is systems where data is date oriented and is aged off, a significant number of tablets that are empty or near-empty can be left around over time resulting in a bloat of tablets.  We need to allow the system to automatically merge tablets.  A configurable merging strategy interface should be used to allow systems to configure how they want this to happen over time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-668769115


   So much complexity when a client-side solution would be simpler... and more easy to coordinate with any client-side code that creates splits. I really don't think it's worth it. I'm not 100% against... but much more against than in favor.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] andrewglowacki edited a comment on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
andrewglowacki edited a comment on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-751519006


   I saw this after checking the status of  #1050. This is the exact situation I am interested in. Maybe this could be possible with a per-table manual configuration?
   
   Otherwise, the solution I'm looking at implementing is:
   - offline the table
   - manually merge the metadata for small tablets (the MERGING state of a merge)
   - online the table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
keith-turner commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-668698749


   This was discussed on a JIRA issue that I Can not find.  What I remember from the JIRA issue that we decided that automatic merges should never merge away user added splits, only automatically added splits.  This avoided situations where a user creates a table, adds lots of splits, goes to lunch, the system merges away all their empty tablets, and then they come back from lunch and ingest lots of data into a single tablet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752705070


   > Is the underlying problem the performance of merging empty tablets on an active table? If so, I wonder if it would be possible to add an option to Merge a range of empty tablets and not have to lock the table.
   
   I don't think you can avoid locking the table in some way, but we may be able to add some sort of range lock.
   
   I was discussing the performance bottlenecks of merging with @EdColeman yesterday, and I pointed out that the biggest problem is chop-compactions, which truncate any non-empty tablets involved in the merge before completing the merge. This can be avoided in a special case if all sequential empty tablets being merged are merged into a single empty tablet, rather than merged with the adjacent non-empty one. This would avoid lots of HDFS operations, and file IO in that special case. In the general case, this can be avoided by storing range constraints per-file to match the original tablet in which the file was specified, as described in 1327.
   
   Eliminating chop compactions would effectively made merges a metadata-only operation, with no file IO, which would eliminate a lot of the performance issues people have had with merging.
   
   
   As for this issue, I still think automatic merging strategies are best kept in user utility code outside of Accumulo's code base, even if it's just getting rid of empty tablets. It's hard to infer user intentions to do anything automatic, and it adds too much complexity to support the user specifying their intentions in some sort of pluggable mechanism, with no substantial value over a fully client-side utility.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] andrewglowacki commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-751778532


   An automated merge could also use some configurable heuristics like:
   - Only consider tablets that have had no data written to them in 24 hours
   - Only consider tables whose number of tablets is more than a multiple of the number of tablet servers


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
keith-turner commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-668754968


   > but I didn't see the lunch example, specifically
   
   Glad you found the issue @ctubbsii .  I think a use case like the lunch example inspired some of the discussion.  Reading over the issue, one design I see emerging from the discussions is the following. 
   
    * Add mechanism to mark splits as eligible or ineligible for automatic merge.
    * Add mechanism to inspect a split points eligibility for automatic merge.
    * Mark any user created splits as ineligible for automatic merge by default.
    * Mark any system created splits as eligible for automatic merge by default.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] andrewglowacki edited a comment on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
andrewglowacki edited a comment on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752630879


   @milleruntime there are two cases:
   - A situation as @ivakegg described originally that results in many tablets empty or very small compared to the split size.
   - Over time you need to increase the split size and now all of your tablets are much smaller than it.
   
   A utility to do this periodically would work, however I'm not sure that alone would suffice since the tablets need to be offline when the merge occurs. Ideally Accumulo would only offline the tablets that need to be, but this would probably be very complicated.
   
   I can contribute the strategy I described above when I'm finished if that would be helpful. It will be a standalone client side tool.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] andrewglowacki commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752630879


   @milleruntime there are two cases:
   - A situation as @ivakegg described originally that results in many tablets empty or very small compared to the split size.
   - Over time you need to increase the split size and now all of your tablets are much smaller than it.
   
   A utility to do this periodically would work, however I'm not sure that alone would suffice since the tablets need to be offline when the merge occurs. Ideally Accumulo would only offline the tablets that need to be, but this would probably be very complicated.
   
   I can contribute the strategy I described above when I'm finished if that would be helpful.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-667412940


   I worry that automatic merging could be used for people to shoot themselves in the foot, so to speak, with respect to performance. Any time Accumulo does something "automatically", it can be a surprise for users. For example, adding this could lead to situations where users unintentionally configure it to delete empty tablets, and then lose all benefits from pre-splitting tables at creation time to prep for efficient distributed bulk ingest.
   
   Baking this feature into Accumulo, would also add significant complexity into Accumulo's own code without much benefit to being there, vs. being a client side process. It seems to me that a client-side process could easily trigger merges as needed, without introducing this complexity internal to Accumulo, and it would perform just as well.
   
   In general, I'd prefer to reduce complexity in Accumulo, and modularize functionality, unless there is a clear benefit to being baked-in to justify the added complexity. In this case, I think a client-side process could perform this function just as well.
   
   What do you think? Do you think there's a clear benefit to having it baked in vs. having a client-side process to perform this function?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752482523


   Or maybe just a separate utility to remove empty tablets.  For example, you have a table with splits configured A-Z for the alphabet.  One time you got a lot of "Quilting" data so there are lots of Q tablets.  There won't be another Quilt convention so you just need to remove tablets "P-Q" from the table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii edited a comment on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
ctubbsii edited a comment on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752705070


   > Is the underlying problem the performance of merging empty tablets on an active table? If so, I wonder if it would be possible to add an option to Merge a range of empty tablets and not have to lock the table.
   
   I don't think you can avoid locking the table in some way, but we may be able to add some sort of range lock.
   
   I was discussing the performance bottlenecks of merging with @EdColeman yesterday, and I pointed out that the biggest problem is chop-compactions, which truncate any non-empty tablets involved in the merge before completing the merge. This can be avoided in a special case if all sequential empty tablets being merged are merged into a single empty tablet, rather than merged with the adjacent non-empty one. This would avoid lots of HDFS operations, and file IO in that special case. In the general case, this can be avoided by storing range constraints per-file to match the original tablet in which the file was specified, as described in #1327.
   
   Eliminating chop compactions would effectively made merges a metadata-only operation, with no file IO, which would eliminate a lot of the performance issues people have had with merging.
   
   
   As for this issue, I still think automatic merging strategies are best kept in user utility code outside of Accumulo's code base, even if it's just getting rid of empty tablets. It's hard to infer user intentions to do anything automatic, and it adds too much complexity to support the user specifying their intentions in some sort of pluggable mechanism, with no substantial value over a fully client-side utility.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-668715033


   [This JIRA issue](https://issues.apache.org/jira/browse/ACCUMULO-1011) was closed as "Won't Fix" after almost 7 years being open and nobody working on it. There were concerns about automatically merging a user's splits there, but I didn't see the lunch example, specifically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner edited a comment on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
keith-turner edited a comment on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-668698749


   This was discussed on a JIRA issue that I Can not find.  What I remember from the JIRA issue is that we decided that automatic merges should never merge away user added splits, only automatically added splits.  This avoided situations where a user creates a table, adds lots of splits, goes to lunch, the system merges away all their empty tablets, and then they come back from lunch and ingest lots of data into a single tablet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] andrewglowacki commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
andrewglowacki commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-751519006


   I saw this after checking the status of  #1050. This is the exact situation I am interested in. Maybe this could be possible with a per-table manual configuration?
   
   Otherwise, the solution I'm looking at implementing is:
   - offline the table
   - manually merge the metadata for small tables (the MERGING state of a merge)
   - online the table


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1669: Automatic merging strategies

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1669:
URL: https://github.com/apache/accumulo/issues/1669#issuecomment-752464842


   Is the underlying problem the performance of merging empty tablets on an active table?  If so, I wonder if it would be possible to add an option to Merge a range of empty tablets and not have to lock the table.  This would essentially just remove the empty tablets from the metadata table.  We could check if the provided range of tablets is empty before locking and if so drop the stale references.  This would eliminate the complexity of trying to creating something in Accumulo to automatically merge and would be easy for the client to perform.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org