You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Hindman <be...@berkeley.edu> on 2015/06/01 15:39:52 UTC

Re: Review Request 30774: Fetcher Cache

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30774/#review84857
-----------------------------------------------------------

Ship it!



src/slave/containerizer/fetcher.hpp
<https://reviews.apache.org/r/30774/#comment136275>

    This looks like a good candidate for inlining too.



src/slave/containerizer/fetcher.hpp
<https://reviews.apache.org/r/30774/#comment136270>

    Dead code?



src/slave/containerizer/fetcher.hpp
<https://reviews.apache.org/r/30774/#comment136271>

    Dead code?



src/slave/containerizer/fetcher.hpp
<https://reviews.apache.org/r/30774/#comment136272>

    These two can easily and cleanly be inlined via lambdas.



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137788>

    A good candidate for a C++11 lambda.



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137789>

    Another candidate for a C++11 lamdba!



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment136273>

    Why the body of this is not inlined should be commented here otherwise someone is likely wondering what was the reason for not inlining `_fetch` here.



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137790>

    Why is os::chown no longer done any more? If we don't need to chown, why? And does that mean we can clean up this code considerably as the comment suggests?



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137792>

    What about closing 'out' and 'err' here now?



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137793>

    And need to close 'out' and 'err' here too.



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment136274>

    This comment needs updating, looks like when you removed `__run` you forgot to do a global search for all instances of `__run`?



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137794>

    CHECK_EQ(space, bytes);



src/slave/containerizer/fetcher.cpp
<https://reviews.apache.org/r/30774/#comment137795>

    Is this really possibly anymore given the current design where 'adjust' will fail if the size we downloaded was bigger than what we first determined via 'fetchSize'?



src/tests/fetcher_cache_tests.cpp
<https://reviews.apache.org/r/30774/#comment137797>

    Please put '{' on newline and this could really use a comment!



src/tests/fetcher_cache_tests.cpp
<https://reviews.apache.org/r/30774/#comment137805>

    I'd like us to elaborate the comment here and add a TODO that we want a generic HTTP server for use in tests that has functionality like pausing requests. I'm not a huge fan of the half-actor half not strategy here, it's not something we want others to replicate and we should explicitly call that out so we don't get more broken windows.



src/tests/fetcher_cache_tests.cpp
<https://reviews.apache.org/r/30774/#comment137803>

    This might break if someone has a machine configured where the default IP is not 127.0.0.1, but everything should stringify correctly for you anyway:
    
    return "http://" + stringify(self().address) + "/" + self().id + "/";



src/tests/fetcher_cache_tests.cpp
<https://reviews.apache.org/r/30774/#comment137806>

    Using locks here is dangerous because on a smaller machine depending on the number of requests we have come through a test we might actually deadlock the entire process. We should call this out explicitly and leave a TODO on how we can do this asynchronously.


- Benjamin Hindman


On May 21, 2015, 4:05 p.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30774/
> -----------------------------------------------------------
> 
> (Updated May 21, 2015, 4:05 p.m.)
> 
> 
> Review request for mesos, Adam B, Benjamin Hindman, Till Toenshoff, and Timothy Chen.
> 
> 
> Bugs: MESOS-2057, MESOS-2069, MESOS-2070, MESOS-2072, MESOS-2073, and MESOS-2074
>     https://issues.apache.org/jira/browse/MESOS-2057
>     https://issues.apache.org/jira/browse/MESOS-2069
>     https://issues.apache.org/jira/browse/MESOS-2070
>     https://issues.apache.org/jira/browse/MESOS-2072
>     https://issues.apache.org/jira/browse/MESOS-2073
>     https://issues.apache.org/jira/browse/MESOS-2074
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Almost all of the functionality in epic MESOS-336. Downloaded files from CommandInfo::URIs can now be cached in a cache directory designated by a slave flag. This only happens when asked for by an extra flag in the URI and is thus backwards-compatible. The cache has a size limit also given by a new slave flag. Cache-resident files are evicted as necessary to make space for newly fetched ones. Concurrent attempts to cache the same URI leads to only one download. The fetcher program remains external for safety reasons, but is now augmented with more elaborate parameters packed into a JSON object to implement specific fetch actions for all of the above. Additional testing includes fetching from (mock) HDFS and coverage of the new features.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 54c4e31ed6dfed3c23d492c19a301ce119a0519b 
>   docs/fetcher-cache-internals.md PRE-CREATION 
>   docs/fetcher.md PRE-CREATION 
>   include/mesos/fetcher/fetcher.proto 311af9aebc6a85dadba9dbeffcf7036b70896bcc 
>   include/mesos/mesos.proto 9cc5782256156ed59fd4640091413b76480d939f 
>   include/mesos/type_utils.hpp 837be6f1844d5fa01c0fd84a585e7ff2cc0c987b 
>   src/Makefile.am 34755cf795391c9b8051a5e4acc6caf844984496 
>   src/hdfs/hdfs.hpp 968545d9af896f3e72e156484cc58135405cef6b 
>   src/launcher/fetcher.cpp 796526f59c25898ef6db2b828b0e2bb7b172ba25 
>   src/slave/constants.hpp fd1c1aba0aa62372ab399bee5709ce81b8e92cec 
>   src/slave/containerizer/docker.hpp ed4ee19c85387882ab2e31baa5610acb8e222d50 
>   src/slave/containerizer/docker.cpp 408a4435a6f11973992486eac1659beeccc4beac 
>   src/slave/containerizer/fetcher.hpp 1db0eaf002c8d0eaf4e0391858e61e0912b35829 
>   src/slave/containerizer/fetcher.cpp 9e9e9d0eb6b0801d53dec3baea32a4cd4acdd5e2 
>   src/slave/containerizer/mesos/containerizer.hpp 3e18617b0dbac58176bfd41dc550898eb6a4a79e 
>   src/slave/containerizer/mesos/containerizer.cpp 696e359de66305512eedf8e269543fafa21f4bc3 
>   src/slave/flags.hpp 5c57478fcfdbcbd8ac0e5c3c79809403054e96e6 
>   src/slave/flags.cpp b5e25186dad36bc1306cc6ecb268aba951a18f7e 
>   src/slave/slave.cpp 8e88482f41f37ce7f2559fe793565b66ac46fb35 
>   src/tests/docker_containerizer_tests.cpp 154bf981c007ebcb8e0b2fe8551defb5ea2ba063 
>   src/tests/fetcher_cache_tests.cpp PRE-CREATION 
>   src/tests/fetcher_tests.cpp 4549e6a631e2c17cec3766efaa556593eeac9a1e 
>   src/tests/mesos.hpp a60df75350beab8d7091cbe66213ecd920942fa4 
>   src/tests/mesos.cpp 1d5639c85517229f3396b40f2d8bd421b2ed7325 
> 
> Diff: https://reviews.apache.org/r/30774/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> --- longer Description: ---
> 
> -Replaces all other reviews for the fetcher cache except those related to stout: 30006, 30033, 30034, 30036, 30037, 30039, 30124, 30173, 30614, 30616, 30618, 30621, 30626. See descriptions of those. In dependency order:
> 
> 30033: Removes the fetcher env tests since these won't be needed any more when the fetcher uses JSON in a single env var as a parameter. They never tested anything that won't be covered by other tests anyway.
> 
> 30034: Makes the code structure of all fetcher tests the same. Instead of calling the run method of the fetcher directly, calling through fetch(). Also removes all uses of I/O redirection, which is not really needed for debugging, and thus the next patch can refactor fetch() and run(). (The latter comes in two varieties, which complicates matters without much benefit.)
> 
> 30036: Extends the CommandInfo::URI protobuf with a boolean "caching" field that will later cause fetcher cache actions. Also introduces the notion of a cache directory to the fetcher info protobuf. And then propagates these additions throughout the rest of the code base where applicable. This includes passing the slave ID all the way down to the place where the cache dir name is constructed.
> 
> 30037: Extends the fetcher info protobuf with "actions" (fetch directly bypassing the cache, fetch through the cache, retrieve from the cache). Switches the basis for dealing with uris to "items", which contain the uri, the action, and potentially a cache file name. Refactors fetch() and run(), so there is only one of each. Introduces about half of the actual cache logic, including a hashmap of cache file objects for bookkeeping and basic operations on it. 
> 
> 30039: Enables fetcher cache actions in the mesos fetcher program.
> 
> 30006: Enables concurrent downloading into the fetcher cache. Reuse of download results in the cache when multiple fetcher runs occur concurrently. 
> 
> 30614: This is to ensure that all this refactoring of fetcher code has not broken HDFS fetching. Adds a test that exercises the C++ code paths in Mesos and mesos-fetcher related to fetching from HDFS. Uses a mock HDFS client written in bash that acts just like a real "hadoop" command if used in the right limited way.
> 
> 30124: Inserted fetcher cache zap upon slave startup, recovery and shutdown. This implements recovery in an acceptable, yet most simple way.
> 
> 30173: Created fetcher cache tests. Adds a new test source file containing a test fixture and tests to find out if the fetcher cache works with a variety of settings.
> 
> 30616: Adds hdfs::du() which calls "hadoop fs -du -h" and returns a string that contains the file size for the URI passed as argument. This is needed to determine the size of a file on HDFS before downloading it to the fetcher cache (to ensure there is enough space).
> 
> 30621: Refactored URI type separation in mesos-fetcher. Moved the URI type separation code (distinguishes http, hdfs, local copying, etc.) from mesos-fetcher to the fetcher process/actor, since it is going to be reused by download size queries when we introduce fetcher cache management. Also factored out URI validation, which will be used the same way by mesos-fetcher and the fetcher process/actor.
> 
> 30626: Fetcher cache eviction. This happens when the cache does not have enough space to accomodate upcoming downloads to the cache. Necessary provisions included here:
> - mesos-fetcher does not run until evictions have been successful
> - Cache space is reserved while (async) waiting for eviction to succeed. If it fails, the reservation gets undone.
> - Reservations can be partly from available space, partly from evictions. All math included :-)
> - To find out how much space is needed, downloading has a prelude in which we query the download size from the URI. This works for all URI types that mesos-fetcher currently supports, including http and hdfs.
> - Size-determination requests are now synchronized, too. Only one per URI in play happens.
> - There is cleanup code for all kinds of error situations. At the very end of the fetch attempt, each list is processed for undoing things like space reservations and eviction disabling.
> - Eviction gets disabled for URIs that are currently in use, i.e. the related cache files are. We use reference counting for this, since there may be concurrent fetch attempts using the same cache files.
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>