You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/12/16 17:42:58 UTC

[GitHub] [accumulo] milleruntime commented on pull request #1259: update du command to use hdfs iterator

milleruntime commented on pull request #1259:
URL: https://github.com/apache/accumulo/pull/1259#issuecomment-996038369


   I was talking to @EdColeman about this command and its various issues. It definitely has performance problems but we are not sure where exactly these are occurring in the code. These changes may help but it is also unclear whether globbing is supported in the method that returns the `RemoteIterator`. A solution to the globbing question would be to rewrite the utility to not use the file globbing. I started looking deeper into the code and noticed that it is complicated due to the nature of the way Accumulo can share files between tables when they are cloned. The command description states: "prints how much space, in bytes, is used by files referenced by a table. When multiple tables are specified it prints how much space, in bytes, is used by files shared between tables, if any."
   
   I ran a test where I cloned a table (ci) and deleted a few rows from the clone (ci2). Now this is what the output of the command prints:
   <pre>
   2021-12-16T12:14:36,461 [Shell.audit] INFO : root@uno> du ci  -h
      36.37G [ci]
   root@uno> du ci2 -h
   2021-12-16T12:14:41,771 [Shell.audit] INFO : root@uno> du ci2 -h
      36.37G [ci2]
   root@uno> du ci ci2 -h
   2021-12-16T12:14:44,676 [Shell.audit] INFO : root@uno> du ci ci2 -h
       1.59G [ci]
      34.78G [ci, ci2]
       1.59G [ci2]
   </pre>
   I am not sure exactly what the first and 3rd numbers are saying or how useful they are but I think the command could be greatly simplified if it only did one table at a time. We could just look at the metadata for that table and give the size of files used by that table. We could check if the file is referenced in a different tableID directory as its current ID and just report that to the user. It could then look like this:
   <pre>
   2021-12-16T12:14:36,461 [Shell.audit] INFO : root@uno> du ci  -h
      36.37G [ci2, ci3]
   </pre>
   This telling the user it has files shared with those other tables. We could add an option to print shared file paths as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org