You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Igor Galić <i....@brainsware.org> on 2013/11/05 22:39:31 UTC

Re: git commit: Doc: cache architecture updates (dir_probe explained)


----- Original Message -----
> Updated Branches:
>   refs/heads/master ebb5e305b -> 0b984c45a
> 
> 
> Doc: cache architecture updates (dir_probe explained)
> 
> 
> Project: http://git-wip-us.apache.org/repos/asf/trafficserver/repo
> Commit: http://git-wip-us.apache.org/repos/asf/trafficserver/commit/0b984c45
> Tree: http://git-wip-us.apache.org/repos/asf/trafficserver/tree/0b984c45
> Diff: http://git-wip-us.apache.org/repos/asf/trafficserver/diff/0b984c45
> 
> Branch: refs/heads/master
> Commit: 0b984c45aa104134d4f9f1da9f6b185b0f452324
> Parents: ebb5e30
> Author: Alan M. Carroll <am...@network-geographics.com>
> Authored: Fri Nov 1 19:42:42 2013 -0500
> Committer: Alan M. Carroll <am...@network-geographics.com>
> Committed: Fri Nov 1 19:42:42 2013 -0500
> 
> ----------------------------------------------------------------------
>  doc/arch/cache/cache-arch.en.rst | 135 ++++++++++++++++++++++++++--------
>  1 file changed, 106 insertions(+), 29 deletions(-)
> ----------------------------------------------------------------------
> 
> 
> http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0b984c45/doc/arch/cache/cache-arch.en.rst
> ----------------------------------------------------------------------
> diff --git a/doc/arch/cache/cache-arch.en.rst
> b/doc/arch/cache/cache-arch.en.rst
> index 1eb87e3..179e5a4 100755
> --- a/doc/arch/cache/cache-arch.en.rst
> +++ b/doc/arch/cache/cache-arch.en.rst
> @@ -109,8 +109,8 @@ represented by :cpp:class:`Dir`. Each entry refers to a
> chunk of contiguous stor
>  variously as "fragments", "segments", "docs" / "documents", and a few other
>  things. This document will use the term
>  "fragment" as that is the most common reference in the code. The term "Doc"
>  (for :cpp:class:`Doc`) will be used to refer
>  to the header data for a fragment. Overall the directory is treated as a
>  hash with a "cache ID" as the key. A cache ID
> -is a 128 value generated in various ways depending on context. This key is
> reduced and used as an index in to the
> -directory to locate an entry in the standard way.
> +is a 128 bit (16 byte) value generated in various ways depending on context.
> This ID is reduced and used as an index in
> +to the directory to locate an entry in the standard way.


"and used as an index in to the directory"

is this supposed to be "into" or are we missing a word here?

>  The directory is used as a memory resident structure which means a directory
>  entry is as small as possible (currently 10
>  bytes). This forces some compromises on the data that can be stored there.
>  On the other hand this means that most cache
> @@ -125,15 +125,17 @@ there is enough memory to run |TS| with an empty cache
> there is enough to run it
>  .. figure:: images/cache-directory-structure.png
>     :align: center
>  
> -Each entry stores the cache ID as the key, along with an offset in the
> stripe and a size. The size stored in the
> -directory entry is an :ref:`approximate size <dir-size>` which is at least
> as big as the actual data. Exact size data is
> -stored in the fragment header on disk.
> +Each entry stores an offset in the stripe and a size. The size stored in the
> directory entry is an :ref:`approximate
> +size <dir-size>` which is at least as big as the actual data. Exact size
> data is stored in the fragment header on disk.
>  
>  .. note::
>  
>     Data in HTTP headers cannot be examined without disk I/O. This includes
>     the original URL for the object. The original
>     source of the cache ID is not stored anywhere.
>  
> +.. _dir-segment:
> +.. _dir-bucket:
> +
>  The entries in a directory are grouped. The first level grouping is a
>  *bucket*. This is a fixed number (currently 4 -
>  defined as ``DIR_DEPTH``) of entries. The index generated from a cache ID is
>  used as a bucket index (not an entry index).
>  Buckets are grouped in to *segments*. All segments in a stripe have the same
>  number of buckets. The number of segments
> @@ -146,7 +148,12 @@ entries in a segment. Note that all segments in the same
> stripe will have the sa
>  Each entry has a previous and next index value which is used to link the
>  entries in the same segment. The index size is
>  16 bits which suffices to index any entry in the same segment. The stripe
>  header contains an array of entry indices
>  which are used as the roots of a free list. When a stripe is initialized the
>  first entry in each bucket is zeroed
> -(marked unused) and all other entries are put in the corresponding segment
> free list rooted in the stripe header.
> +(marked unused) and all other entries are put in the corresponding segment
> free list rooted in the stripe header. In
> +essence the first element of each fixed bucket is used as a root for that
> bucket. The other entries in the fixed bucker

-fixed bucker
+fixed bucket

> +are preferentially preferred for adding to that bucket but this is not

"preferentially preferred" -- what does that even mean?

> required. The segment free lists are initialized
> +such that the extra bucket entries are added in order - all the seconds,
> then the thirds, then the fourths. Because the
> +free lists are FIFOs this means extra entries will be selected from the
> fourth entries across all the buckets first,
> +then the thirds, etc. This maximizes locality for bucket searching.

I'd just like to add, that this is confusing even on a second reading, but
not on a third.

  
>  .. figure:: images/dir-bucket-assign.png
>     :align: center
> @@ -290,7 +297,12 @@ computed by::
>  
>  where ``next_key`` is a global function that deterministically computes a
>  new cache key from an existing cache key.
>  
> -Objects with multiple fragments are laid out such that the data fragments
> (including the earliest ``Doc``) are written first and the first ``Doc`` is
> written last. When read from disk, both the first and earliest ``Doc`` are
> validated (tested to ensure that they haven't been overwritten by the write
> cursor) to verify that the entire document is present on disk (as they
> bookend the other fragments - the write cursor cannot overwrite them without
> overwriting at leastone of the verified ``Doc`` instances). Note that while

leastone -> least one

> the fragments of a single object are ordered they are not necessarily
> contiguous as data from different objects are interleaved as the data
> arrives in |TS|.
> +Objects with multiple fragments are laid out such that the data fragments
> (including the earliest ``Doc``) are written
> +first and the first ``Doc`` is written last. When read from disk, both the
> first and earliest ``Doc`` are validated
> +(tested to ensure that they haven't been overwritten by the write cursor) to
> verify that the entire document is present
> +on disk (as they bookend the other fragments - the write cursor cannot
> overwrite them without overwriting at leastone of
> +the verified ``Doc`` instances). Note that while the fragments of a single
> object are ordered they are not necessarily
> +contiguous as data from different objects are interleaved as the data
> arrives in |TS|.
>  
>  .. figure:: images/cache-multi-fragment.png
>     :align: center
> @@ -320,11 +332,21 @@ Some general observations on the data structures.
>  Cyclone buffer
>  --------------
>  
> -Because the cache is a cyclone cache objects are not preserved for an
> indefinite time. Even if the object is not stale it can be overwritten as
> the cache cycles through its volume. Marking an object as ``pinned``
> preserves the object through the passage of the write cursor but this is
> done by copying the object across the gap, in effect re-storing it in the
> cache. Pinning large objects or a large number objects can lead to a
> excessive disk activity. The original purpose of pinning seems to have been
> for small, frequently used objects explicitly marked by the administrator.
> +Because the cache is a cyclone cache objects are not preserved for an
> indefinite time. Even if the object is not stale
> +it can be overwritten as the cache cycles through its volume. Marking an
> object as ``pinned`` preserves the object
> +through the passage of the write cursor but this is done by copying the
> object across the gap, in effect re-storing it
> +in the cache. Pinning large objects or a large number objects can lead to a
> excessive disk activity. The original
> +purpose of pinning seems to have been for small, frequently used objects
> explicitly marked by the administrator.
>  
> -This means the purpose of expiration data on objects is simply to prevent
> them from being served to clients. They are not in the standard sense
> deleted or cleaned up. The space can't be immediately reclaimed in any event
> because writing only happens at the write cursor. Deleting an object
> consists only of removing the directory entries in the volume directory
> which suffices to (eventually) free the space and render the document
> inaccessible.
> +This means the purpose of expiration data on objects is simply to prevent
> them from being served to clients. They are
> +not in the standard sense deleted or cleaned up. The space can't be
> immediately reclaimed in any event because writing
> +only happens at the write cursor. Deleting an object consists only of
> removing the directory entries in the volume
> +directory which suffices to (eventually) free the space and render the
> document inaccessible.
>  
> -Historically the cache is designed this way because web content was
> relatively small and not particularly consistent. The design also provides
> high performance and low consistency requirements. There are no
> fragmentation issues for the storage, and both cache misses and object
> deletions require no disk I/O. It does not deal particularly well with long
> term storage of large objects. See the :ref:`volume tagging` appendix for
> details on some work in this area.
> +Historically the cache is designed this way because web content was
> relatively small and not particularly consistent.
> +The design also provides high performance and low consistency requirements.
> There are no fragmentation issues for the
> +storage, and both cache misses and object deletions require no disk I/O. It
> does not deal particularly well with long
> +term storage of large objects. See the :ref:`volume tagging` appendix for
> details on some work in this area.
>  
>  Disk Failure
>  ------------
> @@ -338,7 +360,7 @@ Restoring a disk to active duty is quite a bit more
> difficult task. Changing the
>  Implementation Details
>  ======================
>  
> -Volume Directory
> +Stripe Directory
>  ----------------
>  
>  .. _directory-entry:
> @@ -353,18 +375,18 @@ The in memory volume directory entries are defined as
> described below.
>     Name        Type                Use
>     =========== ===================
>     ===================================================
>     offset      unsigned int:24     Offset of first byte of metadata (volume
>     relative)
> -   big         unsigned in:2       Offset multiplier
> +   big         unsigned in:2       Size multiplier
>     size        unsigned int:6      Size
>     tag         unsigned int:12     Partial key (fast collision check)
>     phase       unsigned int:1      Unknown
> -   head        unsigned int:1      Flag: first segment in a document
> +   head        unsigned int:1      Flag: first fragment in an object
>     pinned      unsigned int:1      Flag: document is pinned
>     token       unsigned int:1      Flag: Unknown
>     next        unsigned int:16     Segment local index of next entry.
>     offset_high inku16              High order offset bits
>     =========== ===================
>     ===================================================
>  
> -   The volume directory is an array of ``Dir`` instances. Each entry refers
> to a span in the volume which contains a cached object. Because every object
> in the cache has at least one directory entry this data has been made as
> small as possible.
> +   The stripe directory is an array of ``Dir`` instances. Each entry refers
> to a span in the volume which contains a cached object. Because every object
> in the cache has at least one directory entry this data has been made as
> small as possible.
>  
>     The offset value is the starting byte of the object in the volume. It is
>     40 bits long split between the *offset* (lower 24 bits) and
>     *offset_high* (upper 16 bits) members. Note that since there is a
>     directory for every storage unit in a cache volume, this is the offset
>     in to the slice of a storage unit attached to that volume.
>  
> @@ -384,14 +406,14 @@ The in memory volume directory entries are defined as
> described below.
>  
>     .. _big-mult:
>  
> -   ===== ===============   ===============
> +   ===== ===============   ========================
>     *big* Multiplier        Maximum Size
> -   ===== ===============   ===============
> +   ===== ===============   ========================
>       0   512 (2^9)         32768 (2^15)
>       1   4096 (2^12)       262144 (2^18)
>       2   32768 (2^15)      2097152 (2^21)
>       3   262144 (2^18)     16777216 (2^24)
> -   ===== ===============   ===============
> +   ===== ===============   ========================
>  
>     Note also that *size* is effectively offset by one, so a value of 0
>     indicates a single unit of the multiplier.
>  
> @@ -401,25 +423,64 @@ The target fragment size can set with the
> :file:`records.config` value
>  
>     ``proxy.config.cache.target_fragment_size``
>  
> -This value should be chosen so that it is a multiple of a :ref:`cache entry
> multiplier <big-mult>`. It is not necessary to make it a power of 2 [#]_.
> Larger fragments increase I/O efficiency but lead to more wasted space. The
> default size (1M, 2^20) is a reasonable choice in most circumstances altough
> in very specific cases there can be benefit from tuning this parameter. |TS|
> imposes an internal maximum of a 4194232 bytes which is 4M (2^22) less the
> size of a struct :cpp:class:`Doc`. In practice then the largest reasonable
> target fragment size is 4M - 262144 = 3932160.
> +This value should be chosen so that it is a multiple of a :ref:`cache entry
> multiplier <big-mult>`. It is not necessary
> +to make it a power of 2 [#]_. Larger fragments increase I/O efficiency but
> lead to more wasted space. The default size
> +(1M, 2^20) is a reasonable choice in most circumstances altough in very
> specific cases there can be benefit from tuning
> +this parameter. |TS| imposes an internal maximum of a 4194232 bytes which is
> 4M (2^22) less the size of a struct
> +:cpp:class:`Doc`. In practice then the largest reasonable target fragment
> size is 4M - 262144 = 3932160.
>  
> -When a fragment is stored to disk the size data in the cache index entry is
> set to the finest granularity permitted by the size of the fragment. To
> determine this consult the :ref:`cache entry multipler <big-mult>` table,
> find the smallest maximum size that is at least as large as the fragment.
> That will indicate the value of *big* selected and therefore the granularity
> of the approximate size. That represents the largest possible amount of
> wasted disk I/O when the fragment is read from disk.
> +When a fragment is stored to disk the size data in the cache index entry is
> set to the finest granularity permitted by
> +the size of the fragment. To determine this consult the :ref:`cache entry
> multipler <big-mult>` table, find the smallest
> +maximum size that is at least as large as the fragment. That will indicate
> the value of *big* selected and therefore the
> +granularity of the approximate size. That represents the largest possible
> amount of wasted disk I/O when the fragment is
> +read from disk.
>  
> -.. note:: The cache index entry size is used only for reading the fragment
> from disk. The actual size on disk, and the amount of cache space consumed,
> is the actual size of the content rounded up to the disk sector size
> (default 512 bytes).
> +.. note:: The cache index entry size is used only for reading the fragment
> from disk. The actual size on disk, and the
> +amount of cache space consumed, is the actual size of the content rounded up
> to the disk sector size (default 512
> +bytes).
>  
>  .. index:: DIR_DEPTH, index segment, index buckets
>  
> -The set of index entries for a volume are grouped in to *segments*. The
> number of segments for an index is selected so that there are as few
> segments as possible such that no segment has more than 2^16 entries.
> Intra-segment references can therefore use a 16 bit value to refer to any
> other entry in the segment.
> -
> -Index entries in a segment are grouped *buckets* each of ``DIR_DEPTH``
> (currently 4) entries. These are handled in the standard hash table way,
> giving somewhat less than 2^14 buckets per segment.
> -
> -Object Metadata
> ----------------
> +The set of index entries for a volume are grouped in to *segments*. The
> number of segments for an index is selected so
> +that there are as few segments as possible such that no segment has more
> than 2^16 entries. Intra-segment references can
> +therefore use a 16 bit value to refer to any other entry in the segment.
>  
> -The metadata for an object is stored in a :cpp:class:`Doc`.
> +Index entries in a segment are grouped *buckets* each of ``DIR_DEPTH``
> (currently 4) entries. These are handled in the
> +standard hash table way, giving somewhat less than 2^14 buckets per segment.
>  
>  .. [#] The comment in :file:`records.config` is simply wrong.
>  
> +.. _dir-probe:
> +
> +Directory Probing
> +-----------------
> +
> +Directory probing is locating a specific directory entry in the stripe
> directory based on a cache ID. This is handled
> +primarily by the function :cpp:func:`dir_probe()`. This is passed the cache
> ID (:arg:`key`), a stripe (:arg:`d`), and a
> +last collision (:arg:`last_collision`). The last of these is an in and out
> parameter, updated as useful during the
> +probe.
> +
> +Given an ID, the top half (64 bits) is used as a :ref:`segment
> <dir-segment>` index, taken modulo the number of segments in
> +the directory. The bottom half is used as a :ref:`bucket <dir-bucket>`
> index, taken modulo the number of buckets per
> +segment. The :arg:`last_collision` value is used to mark the last matching
> entry returned by `dir_probe`.
> +
> +After computing the appropriate bucket, the entries in that bucket are
> searched to find a match. In this case a match is
> +detected by comparison of the bottom 12 bits of the cache ID (the *cache
> tag*). The search starts at the base entry for
> +the bucket and then proceeds via the linked list of entries from that first
> entry. If a tag match is found and there is
> +no :arg:`collision` then that entry is returned and :arg:`last_collision` is
> updated to that entry. If :arg:`collision`
> +is set, then if it isn't the current match the search continues down the
> linked list, otherwise :arg:`collision` is
> +cleared and the search continues. The effect of this is that matches are
> skipped until the last returned match
> +(:arg:`last_collision`) is found, after which the next match (if any) is
> returned. If the search falls off the end of
> +the linked list then a miss result is returned (if no last collision),
> otherwise the probe is restarted after clearing
> +the collision on the presumption that the entry for the collision has been
> removed from the bucket. This can lead to
> +repeats among the returned values but guarantees that no valid entry will be
> skipped.
> +
> +Last collision can therefore be used to restart a probe at a later time.
> This is important because the match returned
> +may not be the actual object - although the hashing of the cache ID to a
> bucket and the tag matching is unlikely to
> +create false positives, that is possible. When a fragment is read the full
> cache ID is available and checked and if
> +wrong, that read can be discarded and the next possible match from the
> directory found because the cache virtual
> +connection tracks the last collision value.
> +
>  ----------------
>  Cache Operations
>  ----------------
> @@ -487,11 +548,11 @@ There are three basic steps to a cache lookup.
>  
>     This is normally computed using the request URL but it can be overridden
>     :ref:`by a plugin <cache-key>` . As far as I can tell the cache index
>     string is not stored anywhere, it presumed computable from the client
>     request header.
>  
> -#. The cache volume is determined (based on the cache key).
> +#. The cache stripe is determined (based on the cache key).
>  
>     The cache key is used as a hash key in to an array of :cpp:class:`Vol`
>     instances. The construction and arrangement of this array is the essence
>     of how volumes are assigned.
>  
> -#. The cache volume directory is probed using the index key computed from
> the cache key.
> +#. The cache stripe directory :ref:`is probed <dir-probe>` using the index
> key computed from the cache key.
>  
>     Various other lookaside directories are checked as well, such as the
>     :ref:`aggregation buffer <aggregation-buffer>`.
>  
> @@ -676,6 +737,22 @@ detected after the read completes for a fragment. If it
> is not the first fragmen
>  evacuation (which in turn, when it is read, will pull the subsequent
>  fragment). The logic doesn't seem to check the
>  length and presumes that the end of the alternate is when the next key is
>  not in the directory.
>  
> +This interacts with the "one at a time" strategy of the aggregation write
> logic. If a fragment is close to the fragment being evacuated it may end up
> in the same evacuation bucket. Because the aggregation write checks every
> time for the "next" fragment to evacuate it will find that next fragment and
> evacuate it before it is overwritten.
> +
> +.. note
> +
> +   I do not understand the extra key list that is present in an evacuation
> block. It is labeled as needed for
> +   "collisions" but I am unclear on what might be colliding. The bucket
> entries are stored and matched by stripe offset
> +   but if two fragments collide on their offset, only one can be valid.
> Based on how :ref:`directory probing
> +   <dir-probe>` works and the logic of :cpp:func:`evacuate_fragments()` it
> appears that rather than determine which
> +   entry in a directory bucket is the correct one, all of them are marked
> for evacuation (thereby handling
> +   "collisions"). However, each one could have a distinct fragment size and
> that is set for all of the reads by the
> +   first fragment found in the directory. The intent seems to be to read all
> fragments that collide at the same starting
> +   offset and then figure out which one was really on the disk after the
> read by looking through the key list. However,
> +   this seems to presume those fragments will all be the same size, which
> seems unreasonable. I would think it would
> +   also be necessary to update the size in the :cpp:class:`Dir` instance in
> the evacuation block to the be largest size
> +   found among the collisions.


So.. essentially, this is logic is probably broken then..
 

++ i
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/
GPG: 8716 7A9F 989B ABD5 100F  4008 F266 55D6 2998 1641