You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Igor Galić <i....@brainsware.org> on 2013/12/19 22:14:18 UTC

Re: git commit: Doc: Clean up in cache architecture, added information about stripe assignment initialization.


----- Original Message -----
> Updated Branches:
>   refs/heads/master 2b052f35b -> 0c953e129
> 
> 
> Doc: Clean up in cache architecture, added information about stripe
> assignment initialization.
> 
> 
> Project: http://git-wip-us.apache.org/repos/asf/trafficserver/repo
> Commit: http://git-wip-us.apache.org/repos/asf/trafficserver/commit/0c953e12
> Tree: http://git-wip-us.apache.org/repos/asf/trafficserver/tree/0c953e12
> Diff: http://git-wip-us.apache.org/repos/asf/trafficserver/diff/0c953e12
> 
> Branch: refs/heads/master
> Commit: 0c953e129225ed6db59a97d293d26ecae3f22c92
> Parents: 2b052f3
> Author: Alan M. Carroll <am...@network-geographics.com>
> Authored: Wed Dec 18 14:55:20 2013 -0600
> Committer: Alan M. Carroll <am...@network-geographics.com>
> Committed: Wed Dec 18 14:55:20 2013 -0600
> 
> ----------------------------------------------------------------------
>  doc/arch/cache/cache-arch.en.rst | 148 ++++++++++++++++++++++------------
>  doc/glossary.en.rst              |  42 +++++++++-
>  2 files changed, 139 insertions(+), 51 deletions(-)
> ----------------------------------------------------------------------
> 
> 
> http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/arch/cache/cache-arch.en.rst
> ----------------------------------------------------------------------
> diff --git a/doc/arch/cache/cache-arch.en.rst
> b/doc/arch/cache/cache-arch.en.rst
> index 0018c3a..44b5a75 100644
> --- a/doc/arch/cache/cache-arch.en.rst
> +++ b/doc/arch/cache/cache-arch.en.rst
> @@ -25,9 +25,8 @@ Introduction
>  
>  In addition to an HTTP proxy, |ATS| is also an HTTP cache. |TS| can cache
>  any octet stream although it currently
>  supports only those octet streams delivered by the HTTP protocol. When such
>  a stream is cached (along with the HTTP
> -protocol headers) it is termed an *object* in the cache. Each object is
> identified by a globally unique value called a
> -*cache key*. By default this is generated by taking the `MD5 hash
> <http://www.openssl.org/docs/crypto/md5.html>`_ of the
> -URI used to retrieve the content from the origin server.
> +protocol headers) it is termed an :term:`object <cache object>` in the
> cache. Each object is identified by a globally
> +unique value called a :term:`cache key`.
>  
>  The purpose of this document is to describe the basic structure and
>  implementation details of the |TS| cache.
>  Configuration of the cache will be discussed only to the extent needed to
>  understand the internal mechanisms. This
> @@ -41,20 +40,28 @@ different ways than they are used in the code in an
> attempt to create some consi
>  Cache Layout
>  ~~~~~~~~~~~~
>  
> -The first step in understanding cache operations is to understand the data
> structures and layout of the cache.
> +The following sections describe how persistent cache data is structured.
> |TS| treats its persisent storage an
> +undifferentiated collection of bytes, assuming no other structure to it. In
> particular it does not use the file system
> +of the host operating system. If a file is used it is used only to mark out
> the set of bytes to be used.
>  
>  Cache storage
>  =============
>  
> -The raw storage for the |TS| cache is configured in :file:`storage.config`.
> Each line in the file defines a :term:`cache span` which is treated as an
> undifferentiated unit of storage.
> +The raw storage for the |TS| cache is configured in :file:`storage.config`.
> Each line in the file defines a :term:`cache
> +span` which is treated as a uniform persistent store.
>  
>  .. figure:: images/cache-spans.png
>     :align: center
>  
>     Two cache spans
>  
> -This storage organized in to a set of :term:`cache volume`\ s which are
> defined in :file:`volume.config`. Cache volumes
> -can be defined by a percentage of the total storage or an absolute amount of
> storage. By default each cache volume is spread across all of the cache
> spans for robustness. The intersection of a cache volume and a cache span is
> a :term:`cache stripe`. Each cache span is divided in to cache stripes and
> each cache volume is a collection of those stripes.
> +This storage organized in to a set of :term:`cache volume`\ s which are

in to, not into?

> defined in :file:`volume.config` for the
> +purposes of the administrator. These are the units that used for all other
> administator level configuration.
> +
> +Cache volumes can be defined by a percentage of the total storage or an
> absolute amount of storage. By default each
> +cache volume is spread across all of the cache spans for robustness. The
> intersection of a cache volume and a cache span
> +is a :term:`cache stripe`. Each cache span is divided in to cache stripes
> and each cache volume is a collection of those
> +stripes.
>  
>  If the cache volumes for the example cache spans were defined as
>  
> @@ -66,12 +73,12 @@ then the actual layout would look like
>  .. image:: images/cache-span-layout.png
>     :align: center
>  
> -A cached object is stored entirely in a single stripe, and therefore in a
> single cache span - objects are never split
> -across cache volumes. Objects are assigned to a stripe (and hence to a cache
> volume) automatically based on a hash of
> -the URI used to retrieve the object from the origin server. It is possible
> to configure this to a limited extent in
> -:file:`hosting.config` which supports content from specific host or domain
> to be stored on specific volumes. In
> -addition, as of version 4.0.1 it is possible to control which cache spans
> (and hence, which cache stripes) are contained

what is the benefit of that?

> -in a specific cache volume.
> +Cache stripes are the fundamental unit of cache for the implementation. A
> cached object is stored entirely in a single
> +stripe, and therefore in a single cache span - objects are never split
> across cache spans or volumes. Objects are
> +assigned to a stripe (and hence to a cache volume) automatically based on a
> hash of the URI used to retrieve the object
> +from the origin server. It is possible to configure this to a limited extent
> in :file:`hosting.config` which supports
> +content from specific host or domain to be stored on specific cache volumes.
> In addition, as of version 4.0.1 it is
> +possible to control which cache spans (and hence, which cache stripes) are
> contained in a specific cache volume.
>  
>  The layout and structure of the cache spans, the cache volumes, and the
>  cache stripes that compose them are derived
>  entirely from the :file:`storage.config` and :file:`cache.config` and is
>  recomputed from scratch when the
> @@ -108,9 +115,9 @@ Content in a stripe is tracked via a directory. We call
> each element of the dire
>  represented by :cpp:class:`Dir`. Each entry refers to a chunk of contiguous
>  storage in the cache. These are referred to
>  variously as "fragments", "segments", "docs" / "documents", and a few other
>  things. This document will use the term
>  "fragment" as that is the most common reference in the code. The term "Doc"
>  (for :cpp:class:`Doc`) will be used to refer
> -to the header data for a fragment. Overall the directory is treated as a
> hash with a "cache ID" as the key. A cache ID
> -is a 128 bit (16 byte) value generated in various ways depending on context.
> This ID is reduced and used as an index in
> -to the directory to locate an entry in the standard way.
> +to the header data for a fragment. Overall the directory is treated as a
> hash with the :term:`cache ID` as the key. See
> +:ref:`directory probing <cache-directory-probe>` for how the cache ID is
> used to locate a directory entry. The cache ID
> +is in turn computed from a :term:`cache key` which by default is the URL of
> the content.
>  
>  The directory is used as a memory resident structure which means a directory
>  entry is as small as possible (currently 10
>  bytes). This forces some compromises on the data that can be stored there.
>  On the other hand this means that most cache
> @@ -126,44 +133,61 @@ there is enough memory to run |TS| with an empty cache
> there is enough to run it
>     :align: center
>  
>  Each entry stores an offset in the stripe and a size. The size stored in the
>  directory entry is an :ref:`approximate
> -size <dir-size>` which is at least as big as the actual data. Exact size
> data is stored in the fragment header on disk.
> +size <dir-size>` which is at least as big as the actual data in the
> fragment. Exact size data is stored in the fragment
> +header on disk.
>  
>  .. note::
>  
> -   Data in HTTP headers cannot be examined without disk I/O. This includes
> the original URL for the object. The original
> -   source of the cache ID is not stored anywhere.
> +   Data in HTTP headers cannot be examined without disk I/O. This includes
> the original URL for the object. The cache
> +   key is not stored explicitly and therefore cannot be reliably retrieved.

Huh? What does that mean for all those header_* plugins?

> +The directory is a hash table that uses `chaining
> +<http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables#Collision_resolution>`_
> for collision resolution. Because each
> +entry is small they are used directly as the list header of the hash bucket.
>  
>  .. _dir-segment:
>  .. _dir-bucket:
>  
> -The entries in a directory are grouped. The first level grouping is a
> *bucket*. This is a fixed number (currently 4 -
> -defined as ``DIR_DEPTH``) of entries. The index generated from a cache ID is
> used as a bucket index (not an entry index).
> -Buckets are grouped in to *segments*. All segments in a stripe have the same
> number of buckets. The number of segments
> -in a stripe is chosen so that each segment has as many buckets as possible
> without exceeding 65535 (2\ :sup:`16`\ -1)
> -entries in a segment. Note that all segments in the same stripe will have
> the same number of buckets.
> +Chaining is implemented by imposing grouping structures on the entries in a
> directory. The first level grouping is a
> +:term:`directory bucket`. This is a fixed number (currently 4 - defined as
> ``DIR_DEPTH``) of entries. This
> +serves to define the basic hash buckets with the first entry in each cache
> bucket serving as the root of the hash
> +bucket.
> +
> +.. note::
> +
> +   The term "bucket" is used in the code to mean both the conceptual bucket
> for hashing and for a structural grouping
> +   mechanism in the directory and so these will be qualified as needed to
> distinguish them. The unqualified term
> +   "bucket" is almost always used to mean the structural grouping in the
> directory.
> +
> +Directory buckets are grouped in to :term:`segments <directory segment>`.
> All segments in a stripe have the same number of
> +buckets. The number of segments in a stripe is chosen so that each segment
> has as many buckets as possible without
> +exceeding 65535 (2\ :sup:`16`\ -1) entries in a segment.
>  
>  .. figure:: images/dir-segment-bucket.png
>     :align: center
>  
> -Each entry has a previous and next index value which is used to link the
> entries in the same segment. The index size is
> -16 bits which suffices to index any entry in the same segment. The stripe
> header contains an array of entry indices
> -which are used as the roots of a free list. When a stripe is initialized the
> first entry in each bucket is zeroed
> -(marked unused) and all other entries are put in the corresponding segment
> free list rooted in the stripe header. In
> -essence the first element of each fixed bucket is used as a root for that
> bucket. The other entries in the fixed bucker
> -are preferentially preferred for adding to that bucket but this is not
> required. The segment free lists are initialized
> -such that the extra bucket entries are added in order - all the seconds,
> then the thirds, then the fourths. Because the
> -free lists are FIFOs this means extra entries will be selected from the
> fourth entries across all the buckets first,
> -then the thirds, etc. This maximizes locality for bucket searching.
> +Each directory entry has a previous and next index value which is used to
> link entries in the same segment. Because no
> +segment has more than 65535 entries 16 bits suffices for storing the index
> values. The stripe header contains an array
> +of entry indices which are used as the roots of entry free lists, one for
> each segment. Active entries are stored via
> +the bucket structure. When a stripe is initialized the first entry in each
> bucket is zeroed (marked unused) and all
> +other entries are put in the corresponding segment free list in the stripe
> header. This means the first entry of each
> +directory bucket is used as the root of a hash bucket and is therefore
> marked unused rather than being put a free list.

rather than being put *in* OR *on* a free list.

> +The other entries in the directory bucket are preferentially preferred for

"preferentially preferred" still makes me stumble.. is this intended?

> adding to the corresponding hash bucket but
> +this is not required. The segment free lists are initialized such that the
> extra bucket entries are added in order - all
> +the seconds, then the thirds, then the fourths. Because the free lists are
> FIFOs this means extra entries will be
> +selected from the fourth entries across all the buckets first, then the
> thirds, etc. When allocating a new directory
> +entry in a bucket the entries are searched from first to last, which
> maximizes bucket locality (that is, cache IDs that
> +map to the same hash bucket will also tend to use the same directory
> bucket).
>  
>  .. figure:: images/dir-bucket-assign.png
>     :align: center
>  
>  Entries are removed from the free list when used and returned when no longer
>  in use. When a fragment needs to be put in
> -to the directory the cache ID is used to locate a bucket (which also
> determines the segment). If the first entry in the
> -bucket is marked unused, it is used. If not then the other entries in the
> bucket are searched and if any are on the free
> -list, that entry is used. If none are available then the first entry on the
> segment free list is used. This entry is
> -attached to the bucket via the same next and previous indices used for the
> free list so that it can be found when doing
> -a lookup of a cache ID.
> +to the directory the cache ID is used to locate a hash bucket (which also
> determines the segment and directory bucket).
> +If the first entry in the directory bucket is marked unused, it is used. If
> not then the other entries in the bucket are
> +searched and if any are on the free list, that entry is used. If none are
> available then the first entry on the segment
> +free list is used. This entry is attached to the hash bucket via the same
> next and previous indices used for the free
> +list so that it can be found when doing a lookup of a cache ID.
>  
>  Storage Layout
>  --------------
> @@ -435,10 +459,6 @@ maximum size that is at least as large as the fragment.
> That will indicate the v
>  granularity of the approximate size. That represents the largest possible
>  amount of wasted disk I/O when the fragment is
>  read from disk.
>  
> -.. note:: The cache index entry size is used only for reading the fragment
> from disk. The actual size on disk, and the
> -amount of cache space consumed, is the actual size of the content rounded up
> to the disk sector size (default 512
> -bytes).
> -
>  .. index:: DIR_DEPTH, index segment, index buckets
>  
>  The set of index entries for a volume are grouped in to *segments*. The
>  number of segments for an index is selected so
> @@ -450,7 +470,7 @@ standard hash table way, giving somewhat less than 2^14
> buckets per segment.
>  
>  .. [#] The comment in :file:`records.config` is simply wrong.
>  
> -.. _dir-probe:
> +.. _cache-directory-probe:
>  
>  Directory Probing
>  -----------------
> @@ -552,7 +572,7 @@ There are three basic steps to a cache lookup.
>  
>     The cache key is used as a hash key in to an array of :cpp:class:`Vol`
>     instances. The construction and arrangement of this array is the essence
>     of how volumes are assigned.
>  
> -#. The cache stripe directory :ref:`is probed <dir-probe>` using the index
> key computed from the cache key.
> +#. The cache stripe directory :ref:`is probed <cache-directory-probe>` using
> the index key computed from the cache key.
>  
>     Various other lookaside directories are checked as well, such as the
>     :ref:`aggregation buffer <aggregation-buffer>`.
>  
> @@ -744,7 +764,7 @@ This interacts with the "one at a time" strategy of the
> aggregation write logic.
>     I do not understand the extra key list that is present in an evacuation
>     block. It is labeled as needed for
>     "collisions" but I am unclear on what might be colliding. The bucket
>     entries are stored and matched by stripe offset
>     but if two fragments collide on their offset, only one can be valid.
>     Based on how :ref:`directory probing
> -   <dir-probe>` works and the logic of :cpp:func:`evacuate_fragments()` it
> appears that rather than determine which
> +   <cache-directory-probe>` works and the logic of
> :cpp:func:`evacuate_fragments()` it appears that rather than determine which
>     entry in a directory bucket is the correct one, all of them are marked
>     for evacuation (thereby handling
>     "collisions"). However, each one could have a distinct fragment size and
>     that is set for all of the reads by the
>     first fragment found in the directory. The intent seems to be to read all
>     fragments that collide at the same starting
> @@ -774,17 +794,45 @@ basically four types,
>  * Disk
>  * Raw device
>  
> -After setting all the `Span` instances they are grouped by device id to
> internal linked lists attached to the
> +After creating all the `Span` instances they are grouped by device id to
> internal linked lists attached to the
>  :cpp:member:`Store::disk` array [#]_. Spans that refer to the same
>  directory, disk, or raw device are coalesced in to a
> -single span. Spans that refer to the same file with overlapping offsets are
> also coalesced [#]_. This is all done in :c:func:`ink_cache_init()` called
> during startup.
> +single span. Spans that refer to the same file with overlapping offsets are
> also coalesced [#]_. This is all done in
> +:c:func:`ink_cache_init()` called during startup.
> +
> +.. note:: The span logic is also used by the HostDB and more than one
> otherwise inexplicable feature is provided by the span logic for that
> module.
>  
>  After configuration initialization the cache processor is started by calling
>  :ccp:func:`CacheProcessor::start()`. This
>  does a number of things.
>  
>  For each valid span, an instance of :cpp:class:`CacheDisk` is created. This
>  class is a continuation and so can be used
> -to perform potentially blocking operations on the span. This what is passed
> to the AIO threads to be called when an I/O
> -operation completes. These are then dispatched to AIO threads to perform
> storage unit initialization. After all of those
> -have completed, the resulting storage is distributed across the volumes in
> :c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are
> created at this time.
> +to perform potentially blocking operations on the span. The primary use of
> these is to be passed to the AIO threads as
> +the callback when an I/O operation completes. These are then dispatched to
> AIO threads to perform storage unit
> +initialization. After all of those have completed, the resulting storage is
> distributed across the volumes in
> +:c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are
> created at this time.
> +
> +Cache stripe assignment setup is done once all stripes have initialized
> (that is, the stripe header information has been
> +successfully read from disk for all stripes). The assignment information is
> stored as an array of indices. These are
> +indices in to an array of stripes. Both the assignment and the stripe arrays
> are stored in an instance of
> +:cpp:class:`CacheHostRecord`. Assignment initialization consists of
> populating the assignment array which is much larger
> +than the stripe array.
> +
> +There is an instance of :cpp:class:`CacheHostRecord` for each line in
> :file:`hosting.config` and one "generic" record.
> +For the configured instances the set of stripes is determined from the cache
> volume specified in the line. If no lines are specified all stripes are
> placed in the generic record, otherwise only those stripes marked as default
> are placed in the generic record.

Line break at max 120, please.

> +
> +.. note:: If hosting records are specified it is an error to not specify at
> least one default cache volume.
> +
> +The assignment table is initialized in :c:func:`build_vol_hash_table` which
> is called for each
> +:cpp:class:`CacheHostRecord` instance. For each strip in the host record a

is this supposed to be "strip"?

> sequence of pseudo-random numbers is
> +generated, starting with the folded hash of the stripe hash identifier,
> which is the device path followed by the skip
> +and size values for that stripe, making it unique. This also makes the
> sequence deterministic for any particular stripe.
> +Each stripe gets one number in its sequence for every `VOL_HASH_ALLOC_SIZE`
> (8 MB currently) of storage. These numbers are paired with
> +the stripe index, combined across all stripes, then sorted by the random
> values. The resulting array is sampled for
> +every slot in the stripe assignment table by dividing the maximum random
> value by the size of the assignment table and
> +using the value midway between each multiple of the result of the division.
> The coalesced psuedo-random sequence is
> +scanned for each sample in turn and the first number not greater than the
> sample is found. The stripe associated with
> +that value is used for that assignment table entry.
> +
> +While this procedure is determinstic it is sensitive to initial conditions,
> including the size of each stripe.
>  
>  .. rubric:: Footnotes
>  
> 
> http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/glossary.en.rst
> ----------------------------------------------------------------------
> diff --git a/doc/glossary.en.rst b/doc/glossary.en.rst
> index e103fee..94996db 100644
> --- a/doc/glossary.en.rst
> +++ b/doc/glossary.en.rst
> @@ -44,11 +44,34 @@ Glossary
>  
>     cache stripe
>        A homogenous persistent store for the cache in a single :term:`cache
>        span`. A stripe always resides
> -      entirely on a single physical device and is treated as an
> undifferentiated span of bytes.
> +      entirely on a single physical device and is treated as an
> undifferentiated span of bytes. This is the smallest
> +      independent unit of storage.
>  
>     cache span
>        The physical storage described by a single line in
>        :file:`storage.config`.
>  
> +   cache key
> +      A byte sequence that is a globally unique identifier for an
> :term:`object <cache object>` in the cache. By default
> +      the URL for the object is used.
> +
> +   cache ID
> +      A 128 bit value used as a fixed sized identifier for an object in the
> cache. This is computed from the
> +      :term:`cache key` using the `MD5 hashing function
> <http://www.openssl.org/docs/crypto/md5.html>`_.

Lies :P

> +
> +   cache tag
> +      The bottom few bits (12 currently) of the :term:`cache ID`. This is
> used in the :ref:`cache directory <cache-directory>`
> +      for a preliminary identity check before going to disk.
> +
> +   cache object
> +      The minimal self contained unit of data in the cache. Cache objects
> are the stored version of equivalent content
> +      streams from an origin server. A single object can have multiple
> variants called :term:`alternates <alternate>`.
> +
> +   alternate
> +      A variant of a :term:`cache object`. This was originally created to
> handle the `VARY mechanism
> +      <http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.44>`_

Should we reference HTTP-bis here?

> but has since been used for additional
> +      purposes. All alternates of an object must be equivalent in some
> manner, that is they are alternate forms of the
> +      same stream. The most common example is having normal and compressed
> versions of the stream.
> +
>     storage unit
>        Obsolete term for :term:`cache span`.
>  
> @@ -59,3 +82,20 @@ Glossary
>  
>     write cursor
>        The location in a :term:`cache stripe` where new data is written.
> +
> +   directory segment
> +      A contiguous group of :term:`buckets <directory bucket>`. Each
> :term:`cache stripe` has a set of segments all of
> +      which have the same number of buckets, although the number of buckets
> per segment can vary between cache stripes.
> +      Segments are administrative in purpose to minimize the size of free
> list and hash bucket pointers.
> +
> +   directory bucket
> +      A contiguous fixed sized group of :term:`directory entries <directory
> entry>`. This is used for hash bucket
> +      maintenance optimization.
> +
> +   directory entry
> +      An in memory entry that describes a :term:`cache fragment`.
> +
> +   cache fragment
> +      The unit of storage in the cache. All reads from the cache always read
> exactly one fragment. Fragments may be
> +      written in groups, but every write is always an integral number of
> fragments. Each fragment has a corresponding
> +      :term:`directory entry` which describes its location in the cache
> storage.
> 
> 

-- 
Igor Galić

Tel: +43 (0) 664 886 22 883
Mail: i.galic@brainsware.org
URL: http://brainsware.org/
GPG: 8716 7A9F 989B ABD5 100F  4008 F266 55D6 2998 1641