You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/07/30 07:26:00 UTC

[GitHub] [accumulo] ctubbsii opened a new issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

ctubbsii opened a new issue #1664:
URL: https://github.com/apache/accumulo/issues/1664


   @Manno15 's PR #1650 employed a batching strategy of fixed size batches. This is quite nice for consistent memory utilization, and ensuring the accumulo-gc can always make progress. However, the hard-coded 8MB batch size may not be the optimal size.
   
   Investigation is needed to determine what batch sizes might optimize throughput, and whether it makes sense to make the batch size configurable. Different batch sizes should be tested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman edited a comment on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
EdColeman edited a comment on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666759157


   I may be just adding more words and Christopher's comment has it covered, but in case this helps.
   
   It would be interesting to know some of the ratios when you run different timing scenarios - like number of candidates vs the number of deletes.  
   
   I think the thing we are trying to gauge is if there is a bottle neck in the Accumulo gc processor and do different batch sizes have any impact on where / when a bottle neck could be triggered? Something that could help determine that, is there a timing difference when there is a large number of candidate and few deletes vs a large number of candidates and lots of deletes?  The first case may provide insight into the Accumulo overhead, while the second could be dominated by hdfs.
   The overall goal is to determine if there is are situations where Accumulo could get into a state where it just cannot keep up with deletes and would fall further and further behind. If that can be shown, then what are the triggering conditions and where are the bottle necks, and does batch size have any impact?
   
   Some of this could be simulated, but there also needs to measurements that include hdfs interactions. If it can be shown that the Accumulo gc process is never a bottle neck regardless of batch size, then we know to focus on hdfs interactions for additional performance improvements. If there are times where the gc process is dominating the gc cycle, then what is the performance difference with different batch sizes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii closed issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii closed issue #1664:
URL: https://github.com/apache/accumulo/issues/1664


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666345272


   > I'll take a look at this today.
   
   Cool. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-667228222


   @Manno15 I was also curious about that.  I haven't used it yet but Keith did write up some instructions here: https://github.com/apache/accumulo-testing/blob/master/docs/gcs.md


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666761916


   Feel free to check in to the Slack channel if you have questions. I've been trying to be available there more often when I'm working, leaving the video call open for anybody to jump in if they want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694412494


   @Manno15 I'm curious how you were generating gc candidates in the metadata table. I would have thought you'd be able to easily get over 12MB of candidates, even on modest hardware, by simply adding fake `~del` entries to the metadata table directly. Being configurable, though, should allow users to discover what is optimal for them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-667230143


   Yeah, I looked through the docs and ran though the functions a bit. I might be able to create a performance test based on it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-667110855


   Have you two used the gcs testing suite in accumulo-testing? I am wondering if I can adapt it to test for this.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 edited a comment on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 edited a comment on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694431085


   It's definitely possible that I could have done things more optimal. I was told a good way to produce a lot of delete candidates was to create a pre-split table (I chose 75k), ingest some data, compact the table, clone and then delete one of them. From then on, I just had to compact the table to keep reproducing the delete candidates. This did work well but it did also take a decent amount of time between test runs. 
   
   Another part of the issue is a couple of the laptops in my cluster are very old hardware and have tendencies to crash even when they're idle. 
   
   I haven't tried your method so maybe the way I did it is more convoluted and taxing on the machines. I can look into doing that tomorrow to see if I can get better results. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694359730


   @Manno15 I know you made the batch size configurable in #1706, which is great. But, if you have any results/conclusions from your investigation with various batch sizes that you tested, it'd be good to mention them here. I'm going to close the issue (so I don't forget to later), but you can still comment with your results/conclusions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666309367


   I'll take a look at this today. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666759157


   I may be just adding more words and Christopher's comment has it covered, but in case this helps.
   It would be interesting to know some of the ratios when you run different timing scenarios - like number of candidates vs the number of deletes.  
   I think the thing we are trying to gauge is if there is a bottle neck in the Accumulo gc processor and do different batch sizes have any impact on where / when a bottle neck could be triggered? Something that could help determine that, is there a timing difference when there is a large number of candidate and few deletes vs a large number of candidates and lots of deletes?  The first case may provide insight into the Accumulo overhead, while the second could be dominated by hdfs.
   The overall goal is to determine if there is are situations where Accumulo could get into a state where it just cannot keep up with deletes and would fall further and further behind. If that can be shown, then what are the triggering conditions and where are the bottle necks, and does batch size have any impact?
   Some of this could be simulated, but there also needs to measurements that include hdfs interactions. If it can be shown that the Accumulo gc process is never a bottle neck regardless of batch size, then we know to focus on hdfs interactions for additional performance improvements. If there are times where the gc process is dominating the gc cycle, then what is the performance difference with different batch sizes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694492194


   Thanks for the info on  your method. That way is clever, and should work, too. It is also more realistic as to how Accumulo actually operates (because there may or may not be candidates in use elsewhere in the metadata table, and because the files would actually exist). It's not a big deal. If we identify a better optimal default, we can change the default value, but I wouldn't spend a lot of time on it if I were you. `8M` is already pretty reasonable, I would think, because assuming around 100 chars per candidate, that's still up to 40,000 candidates per batch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666676241


   I'm curious if it matters whether the candidates still have references in the metadata table or not, for the overall throughput. For example, if a batch size of 8MB results in most candidates being still in use, and only a few are actually available for deletion, does the performance drop on the DFS call to actually delete the file? Would it be better to have a larger batch size in order to make that later phase more efficient after the in-use candidates have been removed?
   
   Perhaps you can provide some timing information for a few scenarios. If none of the scenarios seem to be substantially different in the throughput, it's probably not worth making a new property to make this configurable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-667168726


   No, I am not familiar with that suite.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666562469


   My preliminary test results are showing that very minimal time can be saved from changing the batch size to a larger value. I plan on writing a more comprehensive test tomorrow to get a clearer conclusion. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694402054


   From my testing, I can see a clear correlation between the number of times it reaches the batch size limit to how long it takes for the collection cycle to complete. With the trial of never reaching the limit being the quickest. 
   
   The issues I faced were in getting the amount of delete candidates and more importantly the candidate length long enough without my cluster of computers crashing or running out of memory. I was never able to have a trial where the batch size limit greater than 12MB was reached. 
   
   From my actual trials, with 75k delete candidates, a batch size of 12MB completed on average of 34 seconds. Which was the quickest. At the same amount of candidates, a batch size of 2 MB completed in 58 seconds. That's a decent amount of time difference for such a small difference in batch size. I would like to revisit this eventually and give it a full range of tests with batch sizes ranging from at least 12 MB to128 MB. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] Manno15 commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
Manno15 commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-694431085


   It's definitely possible that I could have done things more optimal. I was told a good way to produce a lot of delete candidates was to create a pre-split table (I chose 75k), compact the table, clone and then delete one of them. From then on, I just had to compact the table to keep reproducing the delete candidates. This did work well but it did also take a decent amount of time between test runs. 
   
   Another part of the issue is a couple of the laptops in my cluster are very old hardware and have tendencies to crash even when they're idle. 
   
   I haven't tried your method so maybe the way I did it is more convoluted and taxing on the machines. I can look into doing that tomorrow to see if I can get better results. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-667413278


   > Yeah, I looked through the docs and ran though the functions a bit. I might be able to create a performance test based on it.
   
   That'd be cool.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org