You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/08/13 14:53:56 UTC

[GitHub] [accumulo] cshannon commented on issue #2820: Additional improvements to the du command,

cshannon commented on issue #2820:
URL: https://github.com/apache/accumulo/issues/2820#issuecomment-1214172137

   @EdColeman -
   
   Yesterday/today I spent a good amount of time diving into the Scan api and implementation between the client and server to get a better feel how that works and then I also started working on this a bit. I have a branch with a rough prototype/proof of concept that is a work in progress here: https://github.com/cshannon/accumulo/commits/accumulo-2820
   
   It's not ready for a real review yet as there's more work to be done but you can take a look if you get a chance and see the direction I'm going. I had a couple questions/comments and wanted to get your thoughts.
   
   1. The metadata table scan could technically just be done by the client without an RPC call but I kept the current way of sending an RPC request and letting the server do it inside of the [TableDiskUsage](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/server/base/src/main/java/org/apache/accumulo/server/util/TableDiskUsage.java) class. I think this is much better as it keeps the current design intact and is a simpler update plus this utility already scans metadata for the file names to use for the HDFS iterator. So it can simply be updated to read the sizes from metadata instead and the client/shell code can more or less work the same without many modifications.
   2. I created new disk usage RPC call which same as the old with a new method parameter.  This will allow passing any options we want to custom the du command when sent to the server for processing. The main thing now is a [Mode](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/core/src/main/thrift-gen-java/org/apache/accumulo/core/clientImpl/thrift/TDiskUsageMode.java) enum which just currently has FILE, DIRECTORY, METADATA. The idea is the user running the command could specify how they want to compute the size. The documentation will describe the benefits/drawbacks of each mode. FILE is just the current default way of scanning the HDFS files, DIRECTORY would be for using the hdfs -dus command (not implemented yet in my prototype) and METADATA would be just scanning the metadata table. Having the options parameter and enum for mode will allow us to easily expand in the future with any flags or settings we want to compute usage.
   3. I still need to update things to handle scanning the root table if someone wants to know the metadata table size itself.
   4. I haven't looked at bulk import stuff yet but that could be another mode or just be included automatically, not sure.
   5. Tests of course will still need to be updated and done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org