You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benjamin Mahler <be...@gmail.com> on 2015/03/04 03:36:21 UTC

Re: Review Request 30014: Enable the slave garbage collector to take the number of links into account.

Hey Ritwik, please hold off dealing with review comments. The current
approach here still seems to be not quite what we want.

In the current patch, you're speeding up garbage collection "globally" when
an individual directory has too many child directories. This approach works
fine with "disk usage" as it is a global constraint we must deal with.
However, the "link count" is a per-directory constraint. This means we want
to be able to purge on a per-directory basis rather than globally.

I apologize that I haven't had time to provide specific implementation
guidance and it's looking like this isn't a "newbie" ticket after all.
Bernd is pitching some time in here to help. In the meantime you might want
to pick up some other tickets that will require less shepherding from us!

On Sat, Feb 28, 2015 at 4:54 AM, Bernd Mathiske <be...@mesosphere.io> wrote:

>    This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30014/
>
> First pass.
>
>
>    src/slave/constants.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880941#file880941line64> (Diff
> revision 5)
>
> extern const Duration REGISTRATION_BACKOFF_FACTOR;
>
>    64
>
> extern const double HARD_LINKS_THRESHOLD;
>
>   The other constants have a comment. Why not this one?
>
>
>    src/slave/flags.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line158> (Diff
> revision 5)
>
> public:
>
>    158
>
>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>
>   Danger: is "-" left-associative here or right-associative? Better not to make any assumptions and put extra parentheses.
>
> It seems to me that the first expression is meant to be left-associative and the second right-associative. See more discussion why this is so below.
>
> Where is max_framework_link_usage defined? This comes out of context here.
>
>
>    src/slave/flags.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line170> (Diff
> revision 5)
>
> public:
>
>    170
>
>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>
>   Same as above.
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line237> (Diff
> revision 5)
>
> public:
>
>   236
>
>   // Garbage collects the directories based on the current disk usage.
>
> 237
>
>   // Garbage collects the directories based on the current disk usage and number
>
>   It is probably worth mentioning here that you mean the maximum current number of executor directories  in any active framework.
>
> While you are changing this sentence here, could you maybe also improve on the phrasing "the directories"? Using the definite article "the" without referring to anything in context makes it unclear to the reader what directories are meant.
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line242> (Diff
> revision 5)
>
> public:
>
>   240
>
>   void _checkDiskUsage(const process::Future<double>& usage);
>
> 242
>
>   void _checkGCParameters(const process::Future<GCParameters>& gc_params);
>
>
>    1.
>
>    We don't use snake case, we use camel case for variable naming. Please also correct all other variables in this review.
>    2.
>
>    No need to abbreviate "parameters" to "params".
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line639> (Diff
> revision 5)
>
> 639
>
>   // Getters
>
>   Redundant comment.
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line640> (Diff
> revision 5)
>
> 640
>
>   double getDiskUsage() const;
>
>   Why use getters and setters here at all? What is the benefit if all the setters are public?
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line643> (Diff
> revision 5)
>
> 643
>
>   // Setters
>
>   Redundant comment.
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line648> (Diff
> revision 5)
>
> 648
>
>   // 'disk_usage' denotes a fraction of the disk that is currently being used
>
>   "a fraction"? What other fractions are there? Do you mean "the fraction"?
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line649> (Diff
> revision 5)
>
> 649
>
>   // by the slave process. This usage is defined as a fraction of a limit
>
>   "This usage is defined as a fraction of a limit specified by a flag."
>
> This is too nebulous. Suggestion: reference/point to a place with the actual definition.
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line652> (Diff
> revision 5)
>
> 652
>
>   // expressed as a fraction of the limit on the maximum number of
>
>   "the maximum number of subdirectories"
>
> Please add " in any given parent directory" to be extra clear. Otherwise one could possibly read this mistakenly as:
>
> "the maximum number of subdirectories anywhere in a disk volume".
>
>
>    src/slave/slave.hpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line656> (Diff
> revision 5)
>
> 656
>
>   Try<double> disk_usage, max_link_usage;
>
>
>    1.
>
>    Please declare these two members separately, with separate comments.
>    2.
>
>    "max_link_usage"? This sounds like a limit, not like a current gauge which it is. Suggestion: "topLinkUsage".
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3596> (Diff
> revision 5)
>
> void Slave::registerExecutorTimeout(
>
>    3596
>
>       0.0, std::min((1.0 - flags.gc_disk_headroom - gc_params.getDiskUsage()),
>
>   The indentation in the next line makes this hard to read. I'd say line this up this way:
>
> return flags.gc_delay * std::max(
>       0.0, std::min((1.0 - ...),
>                     (1.0 - ...;
>
> Even better: use variables with good descriptive naming for intermediate expression results. Or break out the formluae into small functions, even if they do not occur in multiple places.
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3597> (Diff
> revision 5)
>
> void Slave::registerExecutorTimeout(
>
>    3597
>
>           (1.0 - flags.gc_disk_headroom - gc_params.getMaxLinkUsage())));
>
>
>    1. This can't be right. Subtracting link usage from disk head room?
>
> Such a glitch is exactly why I would like to suggest writing a unit test later on. Otherwise we may have this feature implemented, but it may not be working at all.
>
>    1. Left or right-associative use of "-" intended here? Suggestion: use extra parentheses! It looks like left-associative is correct here, but it won't be once you have replaced gc_disk_headroom with hard_links_threshold (hardLinksThreshold)... Then you have
>
> (1.0 - 0.8) - maxLinkUsage // seems wrong
>
> Proposal: Make these definitions more parallel in meaning. Use a headroom concept for hard links as well or change disk headroom to a disk usage threshold. I'd opt for the former, since it would touch less code.
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3637> (Diff
> revision 5)
>
> 3637
>
>   GCParameters params;
>
>   params -> parameters
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3638> (Diff
> revision 5)
>
> 3638
>
>   params.setDiskUsage(fs::usage(flags.work_dir));
>
>   insert blank line above and below this line
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3640> (Diff
> revision 5)
>
> 3640
>
>     std::string executors_parent_path = paths::getExecutorsParentPath(
>
>   std::string -> string
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3642> (Diff
> revision 5)
>
> 3642
>
>     struct stat s;
>
>   insert blank line above this one and swap the two declarations, because lstat uses s, not limit.
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3645> (Diff
> revision 5)
>
> 3645
>
>       params.setMaxLinkUsage(ErrnoError(
>
>   Hm. Is it really the best strategy to report nothing at all just because we are unsure about one single framweork?
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3650> (Diff
> revision 5)
>
> 3650
>
>       limit = ::pathconf(executors_parent_path.c_str(), _PC_LINK_MAX);
>
>   Why not declare "limit" here? Why declare it in an outer scope and needlessly initialize it there?
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3651> (Diff
> revision 5)
>
> 3651
>
>       if (limit < 0 && errno != 0) {
>
>   What about "limit < 0" and "errno == 0"? This is a possible response from pathconf. Then you keep going, dividing by zero in line 3656.
>
> Did you mean this?
>
> "limit < 0 || errno != 0"
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3656> (Diff
> revision 5)
>
> 3656
>
>       double link_usage_fraction = 1.0 * s.st_nlink / limit;
>
>   What if we are on a system where the limit gets ignored and mkdir still works and GC for some reason has not caught up yet? Do we get usages above 1.0 then? Is this expected and handled well in therest of the code?
>
>
>    src/slave/slave.cpp
> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3658> (Diff
> revision 5)
>
> 3658
>
>           std::max(params.getMaxLinkUsage(), link_usage_fraction));
>
>   Won't this crash if we have set maxLinkUsage to Error in line 3645?
>
>
> .
>
> - Bernd Mathiske
>
> On February 27th, 2015, 2:48 p.m. PST, Ritwik Yadav wrote:
>   Review request for mesos, Ben Mahler, Dominic Hamon, and Vinod Kone.
> By Ritwik Yadav.
>
> *Updated Feb. 27, 2015, 2:48 p.m.*
>  *Bugs: * MESOS-391 <https://issues.apache.org/jira/browse/MESOS-391>
>  *Repository: * mesos
> Description
>
> A fix to enable the slave garbage collector to take the number of hard links (st_nlinks) into account when creating new directories.
>
>   Testing
>
> Mesos builds sucessfully and all existing tests pass.
>
>   Diffs
>
>    - src/slave/constants.hpp (fd1c1aba0aa62372ab399bee5709ce81b8e92cec)
>    - src/slave/constants.cpp (2a99b1105af0e2d62b956fa96988f477937a46bd)
>    - src/slave/flags.hpp (56b25caf3901b38bdecb50310e8bcae0b114efa8)
>    - src/slave/paths.hpp (1618439d728ded347ec75317ce8dd998acd7ee94)
>    - src/slave/paths.cpp (01ea856aa2e628d4aee5fd31f7e49d147f740e8f)
>    - src/slave/slave.hpp (7b58cade9ada5f0a56b4c60c502eb907ccae33b4)
>    - src/slave/slave.cpp (9f31fa46304398e8f87b41b55d8f4cfd4aba10b9)
>    - src/tests/gc_tests.cpp (deaa6b1b6c32ae6d153229248d7d4f57caa0ebcf)
>
> View Diff <https://reviews.apache.org/r/30014/diff/>
>

Re: Review Request 30014: Enable the slave garbage collector to take the number of links into account.

Posted by Benjamin Mahler <be...@gmail.com>.
It's not clear if we still want to ensure GC events are ordered globally.
We might want to break that order if we have per-item constraints to deal
with (e.g. link count).

Can you elaborate on the delay approach you're mentioning?

On Wed, Mar 4, 2015 at 9:30 AM, Ritwik <ri...@gmail.com> wrote:

> Hi Ben,
>
> I had already addressed all the comments. I was trying to figure out a way
> to reproduce the issue with a failing test before posting a new diff. I am
> sorry for the delay since it took me some time to figure out concepts like
> containerization etc. which were used in the test corresponding to disk
> usage.
>
> I understand the point you are trying to make. However, this contradicts a
> statement which you made earlier. If we want the number of directories to
> affect the garbage collection "locally" then the only way is to re-order
> the sequence of directories which were scheduled for GC by making their
> delay a function of the number of executor directories. I will post a
> detailed solution (if I am able to figure out one) on the ticket before
> delving into the code next time.
>
> In the meanwhile, I will take up simpler issues to solve.
>
> Thanks for all the help and your time.
>
> Best Regards,
>
> On 4 March 2015 at 08:06, Benjamin Mahler <be...@gmail.com>
> wrote:
>
>> Hey Ritwik, please hold off dealing with review comments. The current
>> approach here still seems to be not quite what we want.
>>
>> In the current patch, you're speeding up garbage collection "globally"
>> when an individual directory has too many child directories. This approach
>> works fine with "disk usage" as it is a global constraint we must deal
>> with. However, the "link count" is a per-directory constraint. This means
>> we want to be able to purge on a per-directory basis rather than globally.
>>
>> I apologize that I haven't had time to provide specific implementation
>> guidance and it's looking like this isn't a "newbie" ticket after all.
>> Bernd is pitching some time in here to help. In the meantime you might want
>> to pick up some other tickets that will require less shepherding from us!
>>
>> On Sat, Feb 28, 2015 at 4:54 AM, Bernd Mathiske <be...@mesosphere.io>
>> wrote:
>>
>>>    This is an automatically generated e-mail. To reply, visit:
>>> https://reviews.apache.org/r/30014/
>>>
>>> First pass.
>>>
>>>
>>>    src/slave/constants.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880941#file880941line64> (Diff
>>> revision 5)
>>>
>>> extern const Duration REGISTRATION_BACKOFF_FACTOR;
>>>
>>>    64
>>>
>>> extern const double HARD_LINKS_THRESHOLD;
>>>
>>>   The other constants have a comment. Why not this one?
>>>
>>>
>>>    src/slave/flags.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line158> (Diff
>>> revision 5)
>>>
>>> public:
>>>
>>>    158
>>>
>>>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>>>
>>>   Danger: is "-" left-associative here or right-associative? Better not to make any assumptions and put extra parentheses.
>>>
>>> It seems to me that the first expression is meant to be left-associative and the second right-associative. See more discussion why this is so below.
>>>
>>> Where is max_framework_link_usage defined? This comes out of context here.
>>>
>>>
>>>    src/slave/flags.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line170> (Diff
>>> revision 5)
>>>
>>> public:
>>>
>>>    170
>>>
>>>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>>>
>>>   Same as above.
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line237> (Diff
>>> revision 5)
>>>
>>> public:
>>>
>>>   236
>>>
>>>   // Garbage collects the directories based on the current disk usage.
>>>
>>> 237
>>>
>>>   // Garbage collects the directories based on the current disk usage and number
>>>
>>>   It is probably worth mentioning here that you mean the maximum current number of executor directories  in any active framework.
>>>
>>> While you are changing this sentence here, could you maybe also improve on the phrasing "the directories"? Using the definite article "the" without referring to anything in context makes it unclear to the reader what directories are meant.
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line242> (Diff
>>> revision 5)
>>>
>>> public:
>>>
>>>   240
>>>
>>>   void _checkDiskUsage(const process::Future<double>& usage);
>>>
>>> 242
>>>
>>>   void _checkGCParameters(const process::Future<GCParameters>& gc_params);
>>>
>>>
>>>    1.
>>>
>>>    We don't use snake case, we use camel case for variable naming. Please also correct all other variables in this review.
>>>    2.
>>>
>>>    No need to abbreviate "parameters" to "params".
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line639> (Diff
>>> revision 5)
>>>
>>> 639
>>>
>>>   // Getters
>>>
>>>   Redundant comment.
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line640> (Diff
>>> revision 5)
>>>
>>> 640
>>>
>>>   double getDiskUsage() const;
>>>
>>>   Why use getters and setters here at all? What is the benefit if all the setters are public?
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line643> (Diff
>>> revision 5)
>>>
>>> 643
>>>
>>>   // Setters
>>>
>>>   Redundant comment.
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line648> (Diff
>>> revision 5)
>>>
>>> 648
>>>
>>>   // 'disk_usage' denotes a fraction of the disk that is currently being used
>>>
>>>   "a fraction"? What other fractions are there? Do you mean "the fraction"?
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line649> (Diff
>>> revision 5)
>>>
>>> 649
>>>
>>>   // by the slave process. This usage is defined as a fraction of a limit
>>>
>>>   "This usage is defined as a fraction of a limit specified by a flag."
>>>
>>> This is too nebulous. Suggestion: reference/point to a place with the actual definition.
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line652> (Diff
>>> revision 5)
>>>
>>> 652
>>>
>>>   // expressed as a fraction of the limit on the maximum number of
>>>
>>>   "the maximum number of subdirectories"
>>>
>>> Please add " in any given parent directory" to be extra clear. Otherwise one could possibly read this mistakenly as:
>>>
>>> "the maximum number of subdirectories anywhere in a disk volume".
>>>
>>>
>>>    src/slave/slave.hpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line656> (Diff
>>> revision 5)
>>>
>>> 656
>>>
>>>   Try<double> disk_usage, max_link_usage;
>>>
>>>
>>>    1.
>>>
>>>    Please declare these two members separately, with separate comments.
>>>    2.
>>>
>>>    "max_link_usage"? This sounds like a limit, not like a current gauge which it is. Suggestion: "topLinkUsage".
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3596> (Diff
>>> revision 5)
>>>
>>> void Slave::registerExecutorTimeout(
>>>
>>>    3596
>>>
>>>       0.0, std::min((1.0 - flags.gc_disk_headroom - gc_params.getDiskUsage()),
>>>
>>>   The indentation in the next line makes this hard to read. I'd say line this up this way:
>>>
>>> return flags.gc_delay * std::max(
>>>       0.0, std::min((1.0 - ...),
>>>                     (1.0 - ...;
>>>
>>> Even better: use variables with good descriptive naming for intermediate expression results. Or break out the formluae into small functions, even if they do not occur in multiple places.
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3597> (Diff
>>> revision 5)
>>>
>>> void Slave::registerExecutorTimeout(
>>>
>>>    3597
>>>
>>>           (1.0 - flags.gc_disk_headroom - gc_params.getMaxLinkUsage())));
>>>
>>>
>>>    1. This can't be right. Subtracting link usage from disk head room?
>>>
>>> Such a glitch is exactly why I would like to suggest writing a unit test later on. Otherwise we may have this feature implemented, but it may not be working at all.
>>>
>>>    1. Left or right-associative use of "-" intended here? Suggestion: use extra parentheses! It looks like left-associative is correct here, but it won't be once you have replaced gc_disk_headroom with hard_links_threshold (hardLinksThreshold)... Then you have
>>>
>>> (1.0 - 0.8) - maxLinkUsage // seems wrong
>>>
>>> Proposal: Make these definitions more parallel in meaning. Use a headroom concept for hard links as well or change disk headroom to a disk usage threshold. I'd opt for the former, since it would touch less code.
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3637> (Diff
>>> revision 5)
>>>
>>> 3637
>>>
>>>   GCParameters params;
>>>
>>>   params -> parameters
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3638> (Diff
>>> revision 5)
>>>
>>> 3638
>>>
>>>   params.setDiskUsage(fs::usage(flags.work_dir));
>>>
>>>   insert blank line above and below this line
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3640> (Diff
>>> revision 5)
>>>
>>> 3640
>>>
>>>     std::string executors_parent_path = paths::getExecutorsParentPath(
>>>
>>>   std::string -> string
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3642> (Diff
>>> revision 5)
>>>
>>> 3642
>>>
>>>     struct stat s;
>>>
>>>   insert blank line above this one and swap the two declarations, because lstat uses s, not limit.
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3645> (Diff
>>> revision 5)
>>>
>>> 3645
>>>
>>>       params.setMaxLinkUsage(ErrnoError(
>>>
>>>   Hm. Is it really the best strategy to report nothing at all just because we are unsure about one single framweork?
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3650> (Diff
>>> revision 5)
>>>
>>> 3650
>>>
>>>       limit = ::pathconf(executors_parent_path.c_str(), _PC_LINK_MAX);
>>>
>>>   Why not declare "limit" here? Why declare it in an outer scope and needlessly initialize it there?
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3651> (Diff
>>> revision 5)
>>>
>>> 3651
>>>
>>>       if (limit < 0 && errno != 0) {
>>>
>>>   What about "limit < 0" and "errno == 0"? This is a possible response from pathconf. Then you keep going, dividing by zero in line 3656.
>>>
>>> Did you mean this?
>>>
>>> "limit < 0 || errno != 0"
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3656> (Diff
>>> revision 5)
>>>
>>> 3656
>>>
>>>       double link_usage_fraction = 1.0 * s.st_nlink / limit;
>>>
>>>   What if we are on a system where the limit gets ignored and mkdir still works and GC for some reason has not caught up yet? Do we get usages above 1.0 then? Is this expected and handled well in therest of the code?
>>>
>>>
>>>    src/slave/slave.cpp
>>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3658> (Diff
>>> revision 5)
>>>
>>> 3658
>>>
>>>           std::max(params.getMaxLinkUsage(), link_usage_fraction));
>>>
>>>   Won't this crash if we have set maxLinkUsage to Error in line 3645?
>>>
>>>
>>> .
>>>
>>> - Bernd Mathiske
>>>
>>> On February 27th, 2015, 2:48 p.m. PST, Ritwik Yadav wrote:
>>>   Review request for mesos, Ben Mahler, Dominic Hamon, and Vinod Kone.
>>> By Ritwik Yadav.
>>>
>>> *Updated Feb. 27, 2015, 2:48 p.m.*
>>>  *Bugs: * MESOS-391 <https://issues.apache.org/jira/browse/MESOS-391>
>>>  *Repository: * mesos
>>> Description
>>>
>>> A fix to enable the slave garbage collector to take the number of hard links (st_nlinks) into account when creating new directories.
>>>
>>>   Testing
>>>
>>> Mesos builds sucessfully and all existing tests pass.
>>>
>>>   Diffs
>>>
>>>    - src/slave/constants.hpp (fd1c1aba0aa62372ab399bee5709ce81b8e92cec)
>>>    - src/slave/constants.cpp (2a99b1105af0e2d62b956fa96988f477937a46bd)
>>>    - src/slave/flags.hpp (56b25caf3901b38bdecb50310e8bcae0b114efa8)
>>>    - src/slave/paths.hpp (1618439d728ded347ec75317ce8dd998acd7ee94)
>>>    - src/slave/paths.cpp (01ea856aa2e628d4aee5fd31f7e49d147f740e8f)
>>>    - src/slave/slave.hpp (7b58cade9ada5f0a56b4c60c502eb907ccae33b4)
>>>    - src/slave/slave.cpp (9f31fa46304398e8f87b41b55d8f4cfd4aba10b9)
>>>    - src/tests/gc_tests.cpp (deaa6b1b6c32ae6d153229248d7d4f57caa0ebcf)
>>>
>>> View Diff <https://reviews.apache.org/r/30014/diff/>
>>>
>>
>>
>
>
> --
> *Ritwik Yadav*
>
> Department of Computer Science and Engineering,
> Indian Institute of Technology,
> Kharagpur.
>
> Cell: +91-9635-152346
> Twitter: @iRitwik <https://twitter.com/#%21/iRitwik>
>

Re: Review Request 30014: Enable the slave garbage collector to take the number of links into account.

Posted by Ritwik <ri...@gmail.com>.
Hi Ben,

I had already addressed all the comments. I was trying to figure out a way
to reproduce the issue with a failing test before posting a new diff. I am
sorry for the delay since it took me some time to figure out concepts like
containerization etc. which were used in the test corresponding to disk
usage.

I understand the point you are trying to make. However, this contradicts a
statement which you made earlier. If we want the number of directories to
affect the garbage collection "locally" then the only way is to re-order
the sequence of directories which were scheduled for GC by making their
delay a function of the number of executor directories. I will post a
detailed solution (if I am able to figure out one) on the ticket before
delving into the code next time.

In the meanwhile, I will take up simpler issues to solve.

Thanks for all the help and your time.

Best Regards,

On 4 March 2015 at 08:06, Benjamin Mahler <be...@gmail.com> wrote:

> Hey Ritwik, please hold off dealing with review comments. The current
> approach here still seems to be not quite what we want.
>
> In the current patch, you're speeding up garbage collection "globally"
> when an individual directory has too many child directories. This approach
> works fine with "disk usage" as it is a global constraint we must deal
> with. However, the "link count" is a per-directory constraint. This means
> we want to be able to purge on a per-directory basis rather than globally.
>
> I apologize that I haven't had time to provide specific implementation
> guidance and it's looking like this isn't a "newbie" ticket after all.
> Bernd is pitching some time in here to help. In the meantime you might want
> to pick up some other tickets that will require less shepherding from us!
>
> On Sat, Feb 28, 2015 at 4:54 AM, Bernd Mathiske <be...@mesosphere.io>
> wrote:
>
>>    This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/30014/
>>
>> First pass.
>>
>>
>>    src/slave/constants.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880941#file880941line64> (Diff
>> revision 5)
>>
>> extern const Duration REGISTRATION_BACKOFF_FACTOR;
>>
>>    64
>>
>> extern const double HARD_LINKS_THRESHOLD;
>>
>>   The other constants have a comment. Why not this one?
>>
>>
>>    src/slave/flags.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line158> (Diff
>> revision 5)
>>
>> public:
>>
>>    158
>>
>>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>>
>>   Danger: is "-" left-associative here or right-associative? Better not to make any assumptions and put extra parentheses.
>>
>> It seems to me that the first expression is meant to be left-associative and the second right-associative. See more discussion why this is so below.
>>
>> Where is max_framework_link_usage defined? This comes out of context here.
>>
>>
>>    src/slave/flags.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line170> (Diff
>> revision 5)
>>
>> public:
>>
>>    170
>>
>>         "    0.0, min((1.0 - gc_disk_headroom - disk usage),\n"
>>
>>   Same as above.
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line237> (Diff
>> revision 5)
>>
>> public:
>>
>>   236
>>
>>   // Garbage collects the directories based on the current disk usage.
>>
>> 237
>>
>>   // Garbage collects the directories based on the current disk usage and number
>>
>>   It is probably worth mentioning here that you mean the maximum current number of executor directories  in any active framework.
>>
>> While you are changing this sentence here, could you maybe also improve on the phrasing "the directories"? Using the definite article "the" without referring to anything in context makes it unclear to the reader what directories are meant.
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line242> (Diff
>> revision 5)
>>
>> public:
>>
>>   240
>>
>>   void _checkDiskUsage(const process::Future<double>& usage);
>>
>> 242
>>
>>   void _checkGCParameters(const process::Future<GCParameters>& gc_params);
>>
>>
>>    1.
>>
>>    We don't use snake case, we use camel case for variable naming. Please also correct all other variables in this review.
>>    2.
>>
>>    No need to abbreviate "parameters" to "params".
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line639> (Diff
>> revision 5)
>>
>> 639
>>
>>   // Getters
>>
>>   Redundant comment.
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line640> (Diff
>> revision 5)
>>
>> 640
>>
>>   double getDiskUsage() const;
>>
>>   Why use getters and setters here at all? What is the benefit if all the setters are public?
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line643> (Diff
>> revision 5)
>>
>> 643
>>
>>   // Setters
>>
>>   Redundant comment.
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line648> (Diff
>> revision 5)
>>
>> 648
>>
>>   // 'disk_usage' denotes a fraction of the disk that is currently being used
>>
>>   "a fraction"? What other fractions are there? Do you mean "the fraction"?
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line649> (Diff
>> revision 5)
>>
>> 649
>>
>>   // by the slave process. This usage is defined as a fraction of a limit
>>
>>   "This usage is defined as a fraction of a limit specified by a flag."
>>
>> This is too nebulous. Suggestion: reference/point to a place with the actual definition.
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line652> (Diff
>> revision 5)
>>
>> 652
>>
>>   // expressed as a fraction of the limit on the maximum number of
>>
>>   "the maximum number of subdirectories"
>>
>> Please add " in any given parent directory" to be extra clear. Otherwise one could possibly read this mistakenly as:
>>
>> "the maximum number of subdirectories anywhere in a disk volume".
>>
>>
>>    src/slave/slave.hpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line656> (Diff
>> revision 5)
>>
>> 656
>>
>>   Try<double> disk_usage, max_link_usage;
>>
>>
>>    1.
>>
>>    Please declare these two members separately, with separate comments.
>>    2.
>>
>>    "max_link_usage"? This sounds like a limit, not like a current gauge which it is. Suggestion: "topLinkUsage".
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3596> (Diff
>> revision 5)
>>
>> void Slave::registerExecutorTimeout(
>>
>>    3596
>>
>>       0.0, std::min((1.0 - flags.gc_disk_headroom - gc_params.getDiskUsage()),
>>
>>   The indentation in the next line makes this hard to read. I'd say line this up this way:
>>
>> return flags.gc_delay * std::max(
>>       0.0, std::min((1.0 - ...),
>>                     (1.0 - ...;
>>
>> Even better: use variables with good descriptive naming for intermediate expression results. Or break out the formluae into small functions, even if they do not occur in multiple places.
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3597> (Diff
>> revision 5)
>>
>> void Slave::registerExecutorTimeout(
>>
>>    3597
>>
>>           (1.0 - flags.gc_disk_headroom - gc_params.getMaxLinkUsage())));
>>
>>
>>    1. This can't be right. Subtracting link usage from disk head room?
>>
>> Such a glitch is exactly why I would like to suggest writing a unit test later on. Otherwise we may have this feature implemented, but it may not be working at all.
>>
>>    1. Left or right-associative use of "-" intended here? Suggestion: use extra parentheses! It looks like left-associative is correct here, but it won't be once you have replaced gc_disk_headroom with hard_links_threshold (hardLinksThreshold)... Then you have
>>
>> (1.0 - 0.8) - maxLinkUsage // seems wrong
>>
>> Proposal: Make these definitions more parallel in meaning. Use a headroom concept for hard links as well or change disk headroom to a disk usage threshold. I'd opt for the former, since it would touch less code.
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3637> (Diff
>> revision 5)
>>
>> 3637
>>
>>   GCParameters params;
>>
>>   params -> parameters
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3638> (Diff
>> revision 5)
>>
>> 3638
>>
>>   params.setDiskUsage(fs::usage(flags.work_dir));
>>
>>   insert blank line above and below this line
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3640> (Diff
>> revision 5)
>>
>> 3640
>>
>>     std::string executors_parent_path = paths::getExecutorsParentPath(
>>
>>   std::string -> string
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3642> (Diff
>> revision 5)
>>
>> 3642
>>
>>     struct stat s;
>>
>>   insert blank line above this one and swap the two declarations, because lstat uses s, not limit.
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3645> (Diff
>> revision 5)
>>
>> 3645
>>
>>       params.setMaxLinkUsage(ErrnoError(
>>
>>   Hm. Is it really the best strategy to report nothing at all just because we are unsure about one single framweork?
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3650> (Diff
>> revision 5)
>>
>> 3650
>>
>>       limit = ::pathconf(executors_parent_path.c_str(), _PC_LINK_MAX);
>>
>>   Why not declare "limit" here? Why declare it in an outer scope and needlessly initialize it there?
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3651> (Diff
>> revision 5)
>>
>> 3651
>>
>>       if (limit < 0 && errno != 0) {
>>
>>   What about "limit < 0" and "errno == 0"? This is a possible response from pathconf. Then you keep going, dividing by zero in line 3656.
>>
>> Did you mean this?
>>
>> "limit < 0 || errno != 0"
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3656> (Diff
>> revision 5)
>>
>> 3656
>>
>>       double link_usage_fraction = 1.0 * s.st_nlink / limit;
>>
>>   What if we are on a system where the limit gets ignored and mkdir still works and GC for some reason has not caught up yet? Do we get usages above 1.0 then? Is this expected and handled well in therest of the code?
>>
>>
>>    src/slave/slave.cpp
>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3658> (Diff
>> revision 5)
>>
>> 3658
>>
>>           std::max(params.getMaxLinkUsage(), link_usage_fraction));
>>
>>   Won't this crash if we have set maxLinkUsage to Error in line 3645?
>>
>>
>> .
>>
>> - Bernd Mathiske
>>
>> On February 27th, 2015, 2:48 p.m. PST, Ritwik Yadav wrote:
>>   Review request for mesos, Ben Mahler, Dominic Hamon, and Vinod Kone.
>> By Ritwik Yadav.
>>
>> *Updated Feb. 27, 2015, 2:48 p.m.*
>>  *Bugs: * MESOS-391 <https://issues.apache.org/jira/browse/MESOS-391>
>>  *Repository: * mesos
>> Description
>>
>> A fix to enable the slave garbage collector to take the number of hard links (st_nlinks) into account when creating new directories.
>>
>>   Testing
>>
>> Mesos builds sucessfully and all existing tests pass.
>>
>>   Diffs
>>
>>    - src/slave/constants.hpp (fd1c1aba0aa62372ab399bee5709ce81b8e92cec)
>>    - src/slave/constants.cpp (2a99b1105af0e2d62b956fa96988f477937a46bd)
>>    - src/slave/flags.hpp (56b25caf3901b38bdecb50310e8bcae0b114efa8)
>>    - src/slave/paths.hpp (1618439d728ded347ec75317ce8dd998acd7ee94)
>>    - src/slave/paths.cpp (01ea856aa2e628d4aee5fd31f7e49d147f740e8f)
>>    - src/slave/slave.hpp (7b58cade9ada5f0a56b4c60c502eb907ccae33b4)
>>    - src/slave/slave.cpp (9f31fa46304398e8f87b41b55d8f4cfd4aba10b9)
>>    - src/tests/gc_tests.cpp (deaa6b1b6c32ae6d153229248d7d4f57caa0ebcf)
>>
>> View Diff <https://reviews.apache.org/r/30014/diff/>
>>
>
>


-- 
*Ritwik Yadav*

Department of Computer Science and Engineering,
Indian Institute of Technology,
Kharagpur.

Cell: +91-9635-152346
Twitter: @iRitwik <https://twitter.com/#%21/iRitwik>