You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@buildstream.apache.org by Tristan Van Berkom <tr...@codethink.co.uk> on 2020/09/22 07:29:38 UTC

Proposal: Remove Source.BST_KEY_REQUIRES_STAGE

Hi all,

Today, while enabling some extra logging paths, I noticed this new API
which is... quite confusing.

It looks like this is the tail end of a workstream essentially focused
on staging workspaces through CAS (if I'm not mistaken), I found this
was initially introduced via:

  https://gitlab.com/BuildStream/buildstream/-/issues/1161
  https://gitlab.com/BuildStream/buildstream/-/merge_requests/1651

If I understand correctly, the presence of buildbox-casd makes it
possible for recursive directory checksum calculations to be optimized
when performed repeatedly (otherwise I don't understand the motivation
to replace the checksumming with CAS digests here).

In any case, this has resulted in the following API on Source:

    BST_KEY_REQUIRES_STAGE = False
    """Whether the source will require staging in order to efficiently
       generate a unique key.
    """

This docstring raises some concern to me.


This is extremely confusing and under documented
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Plugins which set this flag do not implement Plugin.get_unique_key(),
which is a bit of a contradiction.

  * All plugins are expected to implement `get_unique_key()`, there is
    no exception to this explained in detail in the `get_unique_key()`
    documentation.

  * The documentation of this flag does not say anything about how
    the unique key will be resolved.

It turns out, the core does some side stepping of `get_unique_key()` in
the case this flag is specified, and unconditionally stages the source
into CAS and returns the digest from `Source._stage_into_cas()`.


This is a bad regression waiting to happen
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reading the docstring in question sends a strange message to Source
plugin developers, which is that Sources are allowed to have their
payload present locally in order to derive a cache key.

The above would be a wildly incorrect presumption, as we know well, a
BuildStream project can only ever lack cache keys when sources have
unresolved refs (and they appear as "no reference" in `bst show`, so we
need to manually set them or `bst source track` them).

I think this is a precious invariant we don't want to risk losing.


Added cognitive complexity in source.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is a large cost in terms of cognitive complexity around the
Source.BST_KEY_REQUIRES_STAGE flag.

There is a functional value to preserve here, but the complexity of
special casing this flag is not necessary to achieve this
functionality.

Let's list the branch statements below to prove the point:

  * We don't call `Source.track()` if it's set.

    However, local sources and workspaces can continue to use a no-op
    here, that is more clear.

  * We add `Source._is_trackable()` for this, just to allow skipping
    running a scheduler job for a source where there will be a no-op.

  * We treat the source specially in `Source._track()` for the case
    that it happens to be called (amongst other element sources), just
    to ensure it has a cache key.

All of the above seems completely pointless, just implement plugin
methods with noops and be done with it, it's better to have the core
treat the plugins in an unbiased fashion than to add switches.

The remaining points:

  * We leverage the CAS to calculate `Plugin.get_unique_key()`

  * We treat it specially at stage time, i.e. we stage within
    CAS because we already staged it at cache key calculation
    time.


Effectively, we half the checksumming for all local data with
this, because we would otherwise sha256sum() the files on disk
only to later add them to the CAS where they will be checksummed again.

Lets deal with only this valuable bit in the proposal below...


Proposed changes
~~~~~~~~~~~~~~~~
The aforementioned functionality should be achievable by the core
offering two utility functions to source plugins (these would be Source
methods, so they would have access to internals like Context, the CAS,
etc).

    def cache_file_or_directory(path: str) -> str:
        """Caches a file or directory in the BuildStream cache and
           returns the digest of the cached directory.
        """

    def stage_cached_directory(sandbox: Sandbox, digest: str, directory: str):
        """Stage a previously cached directory into the sandbox.
        """

In this way, we should be able to only use these in the workspace
mechanics and in the local plugin, remove the documented flag on
Source, remove all the special casing from source.py, and let the core
operate on these sources in an unbiased fashion, removing all the
additionally incurred cognitive complexity added by this flag.

To sweeten the pot further, we could even make these private, since
they are only used by core workspace mechanics and the core "local"
plugin.

Thoughts ?

Cheers,
    -Tristan



Re: Proposal: Remove Source.BST_KEY_REQUIRES_STAGE

Posted by Jürg Billeter <j...@bitron.ch>.
Hi Tristan,

On Tue, 2020-09-22 at 16:29 +0900, Tristan Van Berkom wrote:
> If I understand correctly, the presence of buildbox-casd makes it
> possible for recursive directory checksum calculations to be optimized
> when performed repeatedly (otherwise I don't understand the motivation
> to replace the checksumming with CAS digests here).

Yes, the plan is to add this optimization to buildbox-casd where it
will also be useful to other clients, however, this isn't implemented
yet: https://gitlab.com/BuildGrid/buildbox/buildbox-casd/-/issues/33

This will allow skipping checksum calculation for unmodified files,
which is crucial for workspace performance.

> This is a bad regression waiting to happen
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Reading the docstring in question sends a strange message to Source
> plugin developers, which is that Sources are allowed to have their
> payload present locally in order to derive a cache key.

This is the case for `local` in-project sources. However, sources
must indeed never use local files outside the project (special
exception for workspaces) and, ideally, also in-project sources are not
used beyond the core `local` source plugin.

> Added cognitive complexity in source.py
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> There is a large cost in terms of cognitive complexity around the
> Source.BST_KEY_REQUIRES_STAGE flag.
> 
> There is a functional value to preserve here, but the complexity of
> special casing this flag is not necessary to achieve this
> functionality.
> 
> Let's list the branch statements below to prove the point:
> 
>   * We don't call `Source.track()` if it's set.
> 
>     However, local sources and workspaces can continue to use a no-op
>     here, that is more clear.
> 
>   * We add `Source._is_trackable()` for this, just to allow skipping
>     running a scheduler job for a source where there will be a no-op.
> 
>   * We treat the source specially in `Source._track()` for the case
>     that it happens to be called (amongst other element sources), just
>     to ensure it has a cache key.
> 
> All of the above seems completely pointless, just implement plugin
> methods with noops and be done with it, it's better to have the core
> treat the plugins in an unbiased fashion than to add switches.

Agreed, I can't think of a reason for these special cases to exist.

> 
> The remaining points:
> 
>   * We leverage the CAS to calculate `Plugin.get_unique_key()`
> 
>   * We treat it specially at stage time, i.e. we stage within
>     CAS because we already staged it at cache key calculation
>     time.
> 
> 
> Effectively, we half the checksumming for all local data with
> this, because we would otherwise sha256sum() the files on disk
> only to later add them to the CAS where they will be checksummed
> again.

With the planned buildbox-casd optimization, the biggest benefit will
be not having to checksum unmodified local/workspace files on repeated
BuildStream invocations. This should result in a massive reduction of
session initialization time when workspaces are used.

> Proposed changes
> ~~~~~~~~~~~~~~~~
> The aforementioned functionality should be achievable by the core
> offering two utility functions to source plugins (these would be
> Source
> methods, so they would have access to internals like Context, the
> CAS,
> etc).
> 
>     def cache_file_or_directory(path: str) -> str:
>         """Caches a file or directory in the BuildStream cache and
>            returns the digest of the cached directory.
>         """
> 
>     def stage_cached_directory(sandbox: Sandbox, digest: str,
> directory: str):
>         """Stage a previously cached directory into the sandbox.
>         """
> 
> In this way, we should be able to only use these in the workspace
> mechanics and in the local plugin, remove the documented flag on
> Source, remove all the special casing from source.py, and let the
> core
> operate on these sources in an unbiased fashion, removing all the
> additionally incurred cognitive complexity added by this flag.
> 
> To sweeten the pot further, we could even make these private, since
> they are only used by core workspace mechanics and the core "local"
> plugin.

Overall, this sounds sensible to me. A couple of points:

 * We don't currently expose CAS digests at all in the plugin API, as
   far as I know. Also, CAS digests are protobuf messages, not strings
   (although representation as string would be possible, of course).
   Due to this I'd rather not add the above methods to the public API,
   however, adding these as internal methods would be fine with me
   (possibly switching from `str` to `Digest`).
 * It's not decided yet whether the optimization in buildbox-casd will
   be completely transparent to the client or e.g. require the digest
   of the previous import. Remaining flexible here for now is another
   reason to keep these methods out of the public API.
 * As `get_unique_key()` may get relatively expensive, we need to make
   sure it's only ever called once per session, which sounds like a
   good invariant in any case.

Cheers,
Jürg