You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@trafficserver.apache.org by jp...@apache.org on 2015/11/03 07:09:57 UTC

[21/51] trafficserver git commit: Documentation reorganization

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/architecture.en.rst
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/architecture.en.rst b/doc/developer-guide/architecture/architecture.en.rst
new file mode 100644
index 0000000..e7f8dcf
--- /dev/null
+++ b/doc/developer-guide/architecture/architecture.en.rst
@@ -0,0 +1,1158 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+   or more contributor license agreements.  See the NOTICE file
+   distributed with this work for additional information
+   regarding copyright ownership.  The ASF licenses this file
+   to you under the Apache License, Version 2.0 (the
+   "License"); you may not use this file except in compliance
+   with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing,
+   software distributed under the License is distributed on an
+   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   KIND, either express or implied.  See the License for the
+   specific language governing permissions and limitations
+   under the License.
+
+.. include:: ../../common.defs
+
+.. _developer-cache-architecture:
+
+Cache Architecture
+******************
+
+Introduction
+~~~~~~~~~~~~
+
+In addition to being an HTTP proxy, |ATS| is also an HTTP cache. |TS| can cache
+any octet stream, although it currently supports only those octet streams
+delivered by the HTTP protocol. When such a stream is cached (along with the
+HTTP protocol headers) it is termed an :term:`object <cache object>` in the
+cache. Each object is identified by a globally unique value called a
+:term:`cache key`.
+
+The purpose of this document is to describe the basic structure and
+implementation details of the |TS| cache. Configuration of the cache will be
+discussed only to the extent needed to understand the internal mechanisms. This
+document will be useful primarily to |TS| developers working on the |TS|
+codebase or plugins for |TS|. It is assumed the reader is already familiar with
+the :ref:`admin-guide` and specifically with :ref:`http-proxy-caching` and
+:ref:`configuring-the-cache` along with the associated configuration files and
+values.
+
+Unfortunately, the internal terminology is not particularly consistent, so this
+document will frequently use terms in different ways than they are used in the
+code in an attempt to create some consistency.
+
+Cache Layout
+~~~~~~~~~~~~
+
+The following sections describe how persistent cache data is structured. |TS|
+treats its persisent storage as an undifferentiated collection of bytes,
+assuming no other structure to it. In particular, it does not use the file
+system of the host operating system. If a file is used it is used only to mark
+out the set of bytes to be used.
+
+Cache storage
+=============
+
+The raw storage for the |TS| cache is configured in :file:`storage.config`. Each
+line in the file defines a :term:`cache span` which is treated as a uniform
+persistent store.
+
+.. figure:: images/cache-spans.png
+   :align: center
+
+   Two cache spans
+
+This storage organized into a set of :term:`cache volumes <cache volume>` which
+are defined in :file:`volume.config`. These are the units that are used for all
+other administator level configuration.
+
+Cache volumes can be defined by a percentage of the total storage or as an
+absolute amount of storage. By default, each cache volume is spread across all
+of the cache spans for robustness. The intersection of a cache volume and a
+cache span is a :term:`cache stripe`. Each cache span is divided into cache
+stripes and each cache volume is a collection of those stripes.
+
+If the cache volumes for the example cache spans were defined as:
+
+.. image:: images/ats-cache-volume-definition.png
+   :align: center
+
+Then the actual layout would look like:
+
+.. image:: images/cache-span-layout.png
+   :align: center
+
+Cache stripes are the fundamental unit of cache for the implementation. A
+cached object is stored entirely in a single stripe, and therefore in a single
+cache span. Objects are never split across cache spans or volumes. Objects are
+assigned to a stripe (and in turn to a cache volume) automatically based on a
+hash of the URI used to retrieve the object from the :term:`origin server`. It
+is possible to configure this to a limited extent in :file:`hosting.config`,
+which supports content from specific hosts or domain to be stored on specific
+cache volumes. As of version 4.0.1 it is also possible to control which cache
+spans (and hence, which cache stripes) are contained in a specific cache volume.
+
+The layout and structure of the cache spans, the cache volumes, and the cache
+stripes that compose them are derived entirely from :file:`storage.config` and
+:file:`cache.config` and is recomputed from scratch when the
+:program:`traffic_server` is started. Therefore, any change to those files can
+(and almost always will) invalidate the existing cache in its entirety.
+
+Stripe Structure
+================
+
+|TS| treats the storage associated with a cache stripe as an undifferentiated
+span of bytes. Internally each stripe is treated almost entirely independently.
+The data structures described in this section are duplicated for each stripe.
+Internally the term *volume* is used for these stripes and implemented primarily
+in :cpp:class:`Vol`. What a user thinks of as a volume (and what this document
+calls a *cache volume*) is represented by :cpp:class:`CacheVol`.
+
+.. note::
+
+   Stripe assignment must be done before working with an object because the
+   directory is local to the stripe. Any cached objects for which the stripe
+   assignment is changed are effectively lost as their directory data will not
+   be found in the new stripe.
+
+.. index:: cache directory
+.. _cache-directory:
+
+Cache Directory
+---------------
+
+.. index:: directory entry
+.. index:: fragment
+.. index:: cache ID
+
+.. _fragment:
+
+Content in a stripe is tracked via a directory. Each element of the directory
+is a :term:`directory entry` and is represented by :cpp:class:`Dir`. Each entry
+refers to a chunk of contiguous storage in the cache. These are referred to
+variously as *fragments*, *segments*, *docs*, *documents*, and a few other
+things. This document will use the term *fragment* as that is the most common
+reference in the code. The term *Doc* (for :cpp:class:`Doc`) will be used to
+refer to the header data for a fragment. Overall, the directory is treated as a
+hash with the :term:`cache ID` as the key. See
+:ref:`directory probing <cache-directory-probe>` for how the cache ID is used
+to locate a directory entry. The cache ID is in turn computed from a
+:term:`cache key` which by default is the URL of the content.
+
+The directory is used as a memory resident structure, which means a directory
+entry is as small as possible (currently 10 bytes). This forces some
+compromises on the data that can be stored there. On the other hand this means
+that most cache misses do not require disk I/O, which has a large performance
+benefit.
+
+The directory is always fully sized; once a stripe is initialized the directory
+size is fixed and never changed. This size is related (roughly linearly) to the
+size of the stripe. It is for this reason the memory footprint of |TS| depends
+strongly on the size of the disk cache. Because the directory size does not
+change, neither does this memory requirement, so |TS| does not consume more
+memory as more content is stored in the cache. If there is enough memory to run
+|TS| with an empty cache there is enough to run it with a full cache.
+
+.. figure:: images/cache-directory-structure.png
+   :align: center
+
+Each entry stores an offset in the stripe and a size. The size stored in the
+directory entry is an :ref:`approximate size <dir-size>` which is at least as
+big as the actual data in the fragment. Exact size data is stored in the
+fragment header on disk.
+
+.. note::
+
+   Data in HTTP headers cannot be examined without disk I/O. This includes the
+   original URL for the object. The cache key is not stored explicitly and
+   therefore cannot be reliably retrieved.
+
+The directory is a hash table that uses `chaining
+<http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables#Collision_resolution>`_
+for collision resolution. Because each entry is small they are used directly as
+the list header of the hash bucket.
+
+.. _dir-segment:
+.. _dir-bucket:
+
+Chaining is implemented by imposing grouping structures on the entries in a
+directory. The first level grouping is a :term:`directory bucket`. This is a
+fixed number (currently 4, defined as ``DIR_DEPTH``) of entries. This serves to
+define the basic hash buckets with the first entry in each cache bucket serving
+as the root of the hash bucket.
+
+.. note::
+
+   The term *bucket* is used in the code to mean both the conceptual bucket for
+   hashing and for a structural grouping mechanism in the directory. These will
+   be qualified as needed to distinguish them. The unqualified term *bucket* is
+   almost always used to mean the structural grouping in the directory.
+
+Directory buckets are grouped in to :term:`segments <directory segment>`. All
+segments in a stripe have the same number of buckets. The number of segments in
+a stripe is chosen so that each segment has as many buckets as possible without
+exceeding 65,535 (2\ :sup:`16`\ -1) entries in a segment.
+
+.. figure:: images/dir-segment-bucket.png
+   :align: center
+
+Each directory entry has a previous and next index value which is used to link
+entries in the same segment. Because no segment has more than 65,535 entries,
+16 bits suffices for storing the index values. The stripe header contains an
+array of entry indices which are used as the roots of entry free lists, one for
+each segment. Active entries are stored via the bucket structure. When a stripe
+is initialized the first entry in each bucket is zeroed (marked unused) and all
+other entries are put in the corresponding segment free list in the stripe
+header. This means the first entry of each :term:`directory bucket` is used as
+the root of a hash bucket and is therefore marked unused rather than being put
+a free list. The other entries in the directory bucket are preferred for adding
+to the corresponding hash bucket but this is not required. The segment free
+lists are initialized such that the extra bucket entries are added in order;
+all the seconds, then the thirds, then the fourths. Because the free lists are
+FIFOs, this means extra entries will be selected from the fourth entries across
+all the buckets first, then the thirds, etc. When allocating a new directory
+entry in a bucket the entries are searched from first to last, which maximizes
+bucket locality (that is, :term:`cache IDs <cache ID>` that map to the same
+hash bucket will also tend to use the same directory bucket).
+
+.. figure:: images/dir-bucket-assign.png
+   :align: center
+
+Entries are removed from the free list when used and returned when no longer in
+use. When a :term:`fragment <cache fragment>` needs to be put in to the
+directory the cache ID is used to locate a hash bucket (which also determines
+the segment and directory bucket). If the first entry in the directory bucket
+is marked unused, it is used. Otherwise, the other entries in the bucket are
+searched and if any are on the free list, that entry is used. If none are
+available then the first entry on the segment free list is used. This entry is
+attached to the hash bucket via the same next and previous indices used for the
+free list so that it can be found when doing a lookup of a cache ID.
+
+Storage Layout
+--------------
+
+The storage layout is the stripe metadata followed by cached content. The
+metadata consists of three parts: the stripe header, the directory, and the
+stripe footer. The metadata is stored twice. The header and the footer are
+instances of :cpp:class:`VolHeaderFooter`. This is a stub structure which can
+have a trailing variable sized array. This array is used as the segment free
+list roots in the directory. Each contains the segment index of the first
+element of the free list for the segment. The footer is a copy of the header
+without the segment free lists. This makes the size of the header dependent on
+the directory but not that of the footer.
+
+.. figure:: images/cache-stripe-layout.png
+   :align: center
+
+Each stripe has several values that describe its basic layout:
+
+skip
+   The start of stripe data. This represents either space reserved at the start
+   of a physical device to avoid problems with the host operating system, or an
+   offset representing use of space in the cache span by other stripes.
+
+start
+   The offset for the start of the content, after the stripe metadata.
+
+length
+   Total number of bytes in the stripe. :cpp:member:`Vol::len`.
+
+data length
+   Total number of blocks in the stripe available for content storage.
+   :cpp:member:`Vol::data_blocks`.
+
+.. note::
+
+   Great care must be taken with sizes and lengths in the cache code because
+   there are at least three different metrics (bytes, cache blocks, store
+   blocks) used in various places.
+
+The total size of the directory (the number of :term:`entries <directory entry>`)
+is computed by taking the size of the :term:`cache stripe` and dividing by the
+average object size. The directory always consumes this amount of memory which
+has the effect that if cache size is increased so is the memory requirement for
+|TS|. The average object size defaults to 8000 bytes but can be configured using
+:ts:cv:`proxy.config.cache.min_average_object_size`. Increasing the average
+object size will reduce the memory footprint of the directory at the expense of
+reducing the number of distinct objects that can be stored in the cache.
+
+.. index: write cursor
+.. _write-cursor:
+
+The content area stores the actual objects and is used as a circular buffer
+where new objects overwrite the least recently cached objects. The location in
+a stripe where new cache data is written is called the *write cursor*. This
+means that objects can be de facto evicted from cache even if they have not
+expired if the data is overwritten by the write cursor. If an object is
+overwritten this is not detected at that time and the directory is not updated.
+Instead it will be noted if the object is accessed in the future and the disk
+read of the fragment fails.
+
+.. figure:: images/ats-cache-write-cursor.png
+   :align: center
+
+   The write cursor and documents in the cache.
+
+.. note:: Cache data on disk is never updated.
+
+This is a key thing to keep in mind. What appear to be updates (such as doing a
+refresh on :term:`stale` content and getting back a 304) are actually new
+copies of data being written at the write cursor. The originals are left as
+"dead" space which will be consumed when the write cursor arrives at that disk
+location. Once the stripe directory is updated (in memory) the original
+fragment in the cache is effectively destroyed. This is the general space
+management technique used in other cases as well. If an object needs to removed
+from cache, only the directory needs to be changed. No other work (and
+particularly, no disk I/O) needs to be done.
+
+Object Structure
+================
+
+Objects are stored as two types of data: metadata and content data. Metadata is
+all the data about the object and the content and includes the HTTP headers.
+The content data is the content of the object, the octet stream delivered to
+the client as the object.
+
+Objects are rooted in a :cpp:class:`Doc` structure stored in the cache.
+:cpp:class:`Doc` serves as the header data for a :term:`cache fragment` and is
+contained at the start of every fragment. The first fragment for an object is
+termed the *first Doc* and always contains the object metadata. Any
+operation on the object will read this fragment first. The fragment is located
+by converting the :term:`cache key` for the object to a :term:`cache ID` and
+then doing a lookup for a :term:`directory entry` with that key. The directory
+entry has the offset and approximate size of the first fragment which is then
+read from the disk. This fragment will contain the request header and response
+along with overall object properties (such as content length).
+
+.. index:: alternate
+
+|TS| supports `varying content <http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.44>`_
+for objects. These are called :term:`alternates <alternate>`. All metadata for
+all alternates is stored in the first fragment including the set of alternates
+and the HTTP headers for them. This enables `alternate selection
+<http://trafficserver.apache.org/docs/trunk/sdk/http-hooks-and-transactions/http-alternate-selection.en.html>`_
+to be done after the *first Doc* is read from disk. An object that has more than
+one alternate will have the alternate content stored separately from the first
+fragment. For objects with only one alternate the content may or may not be in
+the same (first) fragment as the metadata. Each separate alternate content is
+allocated a directory entry and the key for that entry is stored in the first
+fragment metadata.
+
+Prior to version 4.0.1, the header data was stored in the
+:cpp:class:`CacheHTTPInfoVector` class which was marshaled to a variable length
+area of the on disk image, followed by information about additional fragments
+if needed to store the object.
+
+.. figure:: images/cache-doc-layout-3-2-0.png
+   :align: center
+
+   ``Doc`` layout 3.2.0
+
+This had the problem that with only one fragment table it could not be reliable
+for objects with more than one alternate[#multiple-alternates]_. Therefore, the
+fragment data was moved from being a separate variable length section of the
+metadata to being directly incorporated in to the :cpp:class:`CacheHTTPInfoVector`,
+yielding a layout of the following form.
+
+.. figure:: images/cache-doc-layout-4-0-1.png
+   :align: center
+
+   ``Doc`` layout 4.0.1
+
+Each element in the vector contains for each alternate, in addition to the HTTP
+headers and the fragment table (if any), a :term:`cache key`. This cache key
+identifies a :term:`directory entry` that is referred to as the *earliest Doc*.
+This is the location where the content for the alternate begins.
+
+When the object is first cached, it will have a single alternate and that will
+be stored (if not too large) in first ``Doc``. This is termed a *resident alternate*
+in the code. This can only happen on the initial store of the object. If the
+metadata is updated (such as a ``304`` response to an ``If-Modified-Since``
+request) then unless the object is small, the object data will be left in the
+original fragment and a new fragment written as the first fragment, making the
+alternate non-resident. *Small* is defined as a length smaller than
+:ts:cv:`proxy.config.cache.alt_rewrite_max_size`.
+
+.. note::
+
+   The :cpp:class:`CacheHTTPInfoVector` is stored only in the first ``Doc``.
+   Subsequent ``Doc`` instances for the object, including the earliest ``Doc``,
+   should have an ``hlen`` of zero and if not, it is ignored.
+
+Large objects are split in to multiple fragments when written to the cache. This
+is indicated by a total document length that is longer than the content in
+first ``Doc`` or an earliest ``Doc``. In such a case a fragment offset table is
+stored. This contains the byte offset in the object content of the first byte
+of content data for each fragment past the first (as the offset for the first
+is always zero). This allows range requests to be serviced much more
+efficiently for large objects, as intermediate fragments that do not contain
+data in the range can be skipped. The last fragment in the sequence is detected
+by the fragment size and offset reaching the end of the total size of the
+object, there is no explicit end mark. Each fragment is computationally chained
+from the previous in that the cache key for fragment N is computed by::
+
+   key_for_N_plus_one = next_key(key_for_N);
+
+Where ``next_key`` is a global function that deterministically computes a new
+cache key from an existing cache key.
+
+Objects with multiple fragments are laid out such that the data fragments
+(including the earliest ``Doc``) are written first and the first ``Doc`` is
+written last. When read from disk, both the first and earliest ``Doc`` are
+validated (tested to ensure that they haven't been overwritten by the write
+cursor) to verify that the entire document is present on disk (as they bookend
+the other fragments - the write cursor cannot overwrite them without overwriting
+at least one of the verified ``Doc`` instances). Note that while the fragments
+of a single object are ordered they are not necessarily contiguous, as data from
+different objects are interleaved as the data arrives in |TS|.
+
+.. figure:: images/cache-multi-fragment.png
+   :align: center
+
+   Multi-alternate and multi-fragment object storage
+
+.. index:: pinned
+
+Documents which are pinned into the cache must not be overwritten so they are
+evacuated from in front of the write cursor. Each fragment is read and
+rewritten. There is a special lookup mechanism for objects that are being
+evacuated so that they can be found in memory rather than the potentially
+unreliable disk regions. The cache scans ahead of the write cursor to discover
+pinned objects as there is a dead zone immediately before the write cursor from
+which data cannot be evacuated. Evacuated data is read from disk and placed in
+the write queue and written as its turn comes up.
+
+Objects can only be pinned via :file:`cache.config` and while
+:ts:cv:`proxy.config.cache.permit.pinning` is set to non-zero (it is zero by
+default). Objects which are in use when the write cursor is near use the same
+underlying evacuation mechanism but are handled automatically and not via the
+explicit ``pinned`` bit in :cpp:class:`Dir`.
+
+.. [#multiple-alternates] It could, under certain circumstances, be accurate for none of the alternates.
+
+Additional Notes
+====================
+
+Some general observations on the data structures.
+
+Cyclone buffer
+--------------
+
+Because the cache is a *cyclone cache*, objects are not preserved for an
+indefinite time. Even if the object is not :term:`stale` it can be overwritten
+as the cache cycles through its volume. Marking an object as *pinned* preserves
+the object through the passage of the write cursor but this is done by copying
+the object across the gap, in effect re-storing it in the cache. Pinning large
+objects or a large number objects can lead to excessive disk activity. The
+original purpose of pinning was for small, frequently used objects explicitly
+marked by the administrator.
+
+This means the purpose of expiration data on objects is simply to prevent them
+from being served to clients. They are not in the standard sense deleted or
+cleaned up. The space can't be immediately reclaimed in any event, because
+writing only happens at the write cursor. Deleting an object consists only of
+removing the directory entries in the volume directory which suffices to
+(eventually) free the space and render the document inaccessible.
+
+Historically, the cache was designed this way because web content was relatively
+small and not particularly consistent. The design also provided high performance
+and low consistency requirements. There are no fragmentation issues for the
+storage, and both cache misses and object deletions require no disk I/O. It does
+not deal particularly well with long term storage of large objects. See the
+:ref:`volume tagging` appendix for details on some work in this area.
+
+Disk Failure
+------------
+
+The cache is designed to be relatively resistant to disk failures. Because each
+:term:`storage unit` in each :term:`cache volume` is mostly independent, the
+loss of a disk simply means that the corresponding :cpp:class:`Vol` instances
+(one per cache volume that uses the storage unit) becomes unusable. The primary
+issue is updating the volume assignment table to both preserve assignments for
+objects on still operational volumes while distributing the assignments from the
+failed disk to those operational volumes. This mostly done in::
+
+   AIO_Callback_handler::handle_disk_failure
+
+Restoring a disk to active duty is a more difficult task. Changing the volume
+assignment of a :term:`cache key` renders any currently cached data
+inaccessible. This is not a problem when a disk has failed, but is a bit
+trickier to decide which cached objects are to be de facto evicted if a new
+storage unit is added to a running system. The mechanism for this, if any, is
+still under investigation.
+
+Implementation Details
+======================
+
+Stripe Directory
+----------------
+
+.. _directory-entry:
+
+The in memory volume directory entries are described below.
+
+.. cpp:class:: Dir
+
+   Defined in |P-CacheDir.h|_.
+
+   =========== =================== ===================================================
+   Name        Type                Use
+   =========== =================== ===================================================
+   offset      unsigned int:24     Offset of first byte of metadata (volume relative)
+   big         unsigned in:2       Size multiplier
+   size        unsigned int:6      Size
+   tag         unsigned int:12     Partial key (fast collision check)
+   phase       unsigned int:1      Phase of the ``Doc`` (for dir valid check)
+   head        unsigned int:1      Flag: first fragment in an object
+   pinned      unsigned int:1      Flag: document is pinned
+   token       unsigned int:1      Flag: Unknown
+   next        unsigned int:16     Segment local index of next entry.
+   offset_high inku16              High order offset bits
+   =========== =================== ===================================================
+
+The stripe directory is an array of ``Dir`` instances. Each entry refers to
+a span in the volume which contains a cached object. Because every object in
+the cache has at least one directory entry this data has been made as small
+as possible.
+
+The offset value is the starting byte of the object in the volume. It is 40
+bits long, split between the *offset* (lower 24 bits) and *offset_high*
+(upper 16 bits) members. Note that since there is a directory for every
+storage unit in a cache volume, this is the offset in to the slice of a
+storage unit attached to that volume.
+
+.. _dir-size:
+
+The *size* and *big* values are used to calculate the approximate size of
+the fragment which contains the object. This value is used as the number of
+bytes to read from storage at the offset value. The exact size is contained
+in the object metadata in :cpp:class:`Doc` which is consulted once the read
+has completed. For this reason, the approximate size needs to be at least as
+large as the actual size but can be larger, at the cost of reading the
+extraneous bytes.
+
+The computation of the approximate size of the fragment is defined as::
+
+  ( *size* + 1 ) * 2 ^ ( CACHE_BLOCK_SHIFT + 3 * *big* )
+
+Where ``CACHE_BLOCK_SHIFT`` is the bit width of the size of a basic cache
+block (9, corresponding to a sector size of 512). Therefore the value with
+current defines is::
+
+  ( *size* + 1 ) * 2 ^ (9 + 3 * *big*)
+
+.. _big-mult:
+
+Because *big* is 2 bits, the values for the multiplier of *size* are:
+
+   ===== =============== ========================
+   *big* Multiplier      Maximum Size
+   ===== =============== ========================
+   0     512 (2^9)       32768 (2^15)
+   1     4096 (2^12)     262144 (2^18)
+   2     32768 (2^15)    2097152 (2^21)
+   3     262144 (2^18)   16777216 (2^24)
+   ===== =============== ========================
+
+Note also that *size* is effectively offset by one, so a value of 0 indicates
+a single unit of the multiplier.
+
+.. _target-fragment-size:
+
+The target fragment size can set with the :file:`records.config` value
+:ts:cv:`proxy.config.cache.target_fragment_size`.
+
+This value should be chosen so that it is a multiple of a
+:ref:`cache entry multiplier <big-mult>`. It is not necessary to make it a
+power of two[#cache-mult-value]_. Larger fragments increase I/O efficiency but
+lead to more wasted space. The default size (1M, 2^20) is a reasonable choice
+in most circumstances, altough in very specific cases there can be benefit from
+tuning this parameter. |TS| imposes an internal maximum of a 4,194,232 bytes,
+which is 4M (2^22), less the size of a struct :cpp:class:`Doc`. In practice,
+the largest reasonable target fragment size is 4M - 262,144 = 3,932,160.
+
+When a fragment is stored to disk, the size data in the cache index entry is
+set to the finest granularity permitted by the size of the fragment. To
+determine this, consult the :ref:`cache entry multipler <big-mult>` table and
+find the smallest maximum size that is at least as large as the fragment. That
+will indicate the value of *big* selected and therefore the granularity of the
+approximate size. That represents the largest possible amount of wasted disk I/O
+when the fragment is read from disk.
+
+.. index:: DIR_DEPTH, index segment, index buckets
+
+The set of index entries for a volume are grouped in to :term:`segments <directory segment>`.
+The number of segments for an index is selected so that there are as few
+segments as possible such that no segment has more than 2^16 entries.
+Intra-segment references can therefore use a 16 bit value to refer to any other
+entry in the segment.
+
+Index entries in a segment are grouped :term:`buckets <directory bucket>`, each
+of ``DIR_DEPTH`` (currently 4) entries. These are handled in the standard hash
+table manner, giving somewhat less than 2^14 buckets per segment.
+
+.. [#cache-mult-value]
+
+   The comment in earlier versions of the :file:`records.config` documentation
+   which indicated that this value must be a power of two were, unfortunately,
+   mistaken and have been corrected.
+
+.. _cache-directory-probe:
+
+Directory Probing
+-----------------
+
+Directory probing is the locating of a specific :term:`directory entry` in the
+stripe directory based on a :term:`cache ID`. This is handled primarily by the
+function :cpp:func:`dir_probe()`. This is passed the cache ID (:arg:`key`), a
+stripe (:arg:`d`), and a last collision (:arg:`last_collision`). The last of
+these is an in and out parameter, updated as useful during the probe.
+
+Given an ID, the top half (64 bits) is used as a :ref:`segment <dir-segment>`
+index, taken modulo the number of segments in the directory. The bottom half is
+used as a :ref:`bucket <dir-bucket>` index, taken modulo the number of buckets
+per segment. The :arg:`last_collision` value is used to mark the last matching
+entry returned by :cpp:func:`dir_probe`.
+
+After computing the appropriate bucket, the entries in that bucket are searched
+to find a match. In this case a match is detected by comparison of the bottom
+12 bits of the :term:`cache ID` (the *cache tag*). The search starts at the base
+entry for the bucket and then proceeds via the linked list of entries from that
+first entry. If a tag match is found and there is no :arg:`collision` then that
+entry is returned and :arg:`last_collision` is updated to that entry. If
+:arg:`collision` is set and if it isn't the current match, the search continues
+down the linked list, otherwise :arg:`collision` is cleared and the search
+continues.
+
+The effect of this is that matches are skipped until the last returned match
+(:arg:`last_collision`) is found, after which the next match (if any) is
+returned. If the search falls off the end of the linked list, then a miss result
+is returned (if no last collision), otherwise the probe is restarted after
+clearing the collision on the presumption that the entry for the collision has
+been removed from the bucket. This can lead to repeats among the returned
+values but guarantees that no valid entry will be skipped.
+
+Last collision can therefore be used to restart a probe at a later time. This
+is important because the match returned may not be the actual object. Although
+the hashing of the :term:`cache ID` to a :term:`bucket <directory bucket>` and
+the tag matching is unlikely to create false positives, it is possible. When a
+fragment is read the full cache ID is available and checked and if wrong, that
+read can be discarded and the next possible match from the directory found
+because the cache virtual connection tracks the last collision value.
+
+----------------
+Cache Operations
+----------------
+
+Cache activity starts after the HTTP request header has been parsed and
+remapped. Tunneled transactions do not interact with the cache because the
+headers are never parsed.
+
+To understand the logic we must introduce the term *cache valid* which means
+something that is directly related to an object that is valid to be put in the
+cache (e.g. a ``DELETE`` which refers to a URL that is cache valid but cannot
+be cached itself). This is important because |TS| computes cache validity
+several times during a transaction and only performs cache operations for cache
+valid results. The criteria used changes during the course of the transaction
+as well. This is done to avoid the cost of cache activity for objects that
+cannot be in the cache.
+
+The three basic cache operations are: lookup, read, and write. We will take
+deleting entries as a special case of writing where only the volume directory
+is updated.
+
+After the client request header is parsed and is determined to be potentially
+cacheable, a `cache lookup`_ is done. If successful, a `cache read`_ is
+attempted. If either the lookup or the read fails and the content is considered
+cacheable then a `cache write`_ is attempted.
+
+Cacheability
+============
+
+The first thing done with a request with respect to cache is to determine
+whether it is potentially a valid object for the cache. After initial parsing
+and remapping, this check is done primarily to detect a negative result, as it
+allows further cache processing to be skipped. It will not be put in to the
+cache, nor will a cache lookup be performed. There are a number of prerequisites
+along with configuration options to change them. Additional cacheability checks
+are done later in the process, when more is known about the transaction (such
+as plugin operations and the origin server response). Those checks are described
+as appropriate in the sections on the relevant operations.
+
+The set of things which can affect cacheability are:
+
+* Built in constraints.
+* Settings in :file:`records.config`.
+* Settings in :file:`cache.config`.
+* Plugin operations.
+
+The initial internal checks, along with their :file:`records.config`
+overrides[#cacheability-overrides]_, are done in ``HttpTransact::is_request_cache_lookupable``.
+
+The checks that are done are:
+
+   Cacheable Method
+      The request must be one of ``GET``, ``HEAD``, ``POST``, ``DELETE``, ``PUT``.
+
+      See ``HttpTransact::is_method_cache_lookupable()``.
+
+   Dynamic URL
+      |TS| tries to avoid caching dynamic content because it's dynamic. A URL is
+      considered dynamic if:
+
+      *  It is not ``HTTP`` or ``HTTPS``,
+      *  Has query parameters,
+      *  Ends in ``asp``,
+      *  Has ``cgi`` in the path.
+
+      This check can be disabled by setting a non-zero value for
+      :ts:cv:`proxy.config.http.cache.cache_urls_that_look_dynamic`.
+
+      In addition if a TTL is set for rule that matches in :file:`cache.config`
+      then this check is not done.
+
+   Range Request
+      Cache valid only if :ts:cv:`proxy.config.http.cache.range.lookup` in
+      :file:`records.config` is non-zero. This does not mean the range request
+      can be cached, only that it might be satisfiable from the cache. In
+      addition, :ts:cv:`proxy.config.http.cache.range.write`
+      can be set to try to force a write on a range request. This
+      probably has little value at the moment, but if for example the
+      origin server ignores the ``Range:`` header, this option can allow
+      for the response to be cached. It is disabled by default, for
+      best performance.
+
+A plugin can call :c:func:`TSHttpTxnReqCacheableSet()` to force the request to
+be viewed as cache valid.
+
+.. [#cacheability-overrides]
+
+   The code appears to check :file:`cache.config` in this logic by setting the
+   ``does_config_permit_lookup`` in the ``cache_info.directives`` of the state
+   machine instance but I can find no place where the value is used. The
+   directive ``does_config_permit_storing`` is set and later checked so the
+   directive (from the administrator point of view) is effective in preventing
+   caching of the object.
+
+Cache Lookup
+============
+
+If the initial request is not determined to be cache invalid then a lookup is
+done. Cache lookup determines if an object is in the cache and if so, where it
+is located. In some cases the lookup proceeds to read the first ``Doc`` from
+disk to verify the object is still present in the cache.
+
+The basic steps to a cache lookup are:
+
+#. The cache key is computed.
+
+   This is normally computed using the request URL but it can be overridden
+   :ref:`by a plugin <cache-key>` . The cache index string is not stored, as it
+   is presumed computable from the client request headers.
+
+#. The cache stripe is determined (based on the cache key).
+
+   The :term:`cache key` is used as a hash key in to an array of
+   :cpp:class:`Vol` instances. The construction and arrangement of this array
+   is the essence of how volumes are assigned.
+
+#. The cache stripe directory :ref:`is probed <cache-directory-probe>` using the
+   index key computed from the cache key.
+
+   Various other lookaside directories are checked as well, such as the
+   :ref:`aggregation buffer <aggregation-buffer>`.
+
+#. If the directory entry is found the first ``Doc`` is read from disk and
+   checked for validity.
+
+   This is done in :cpp:func:`CacheVC::openReadStartHead()` or
+   :cpp:func:`CacheVC::openReadStartEarliest()` which are tightly coupled
+   methods.
+
+If the lookup succeeds, then a more detailed directory entry (struct
+:cpp:class:`OpenDir`) is created. Note that the directory probe includes a check
+for an already extant ``OpenDir`` which, if found, is returned without
+additional work.
+
+Cache Read
+==========
+
+Cache read starts after a successful `cache lookup`_. At this point the first
+``Doc`` has been loaded in to memory and can be consulted for additional
+information. This will always contain the HTTP headers for all
+:term:`alternates <alternate>` of the object.
+
+.. sidebar:: Read while write
+
+   There is provision in the code to support *read while write*, that is,
+   serving an object from cache in one transaction while it is being written in
+   another. Several settings are needed for it to be used. See
+   :ref:`admin-configuration-reducing-origin-requests`. It must
+   specifically enabled in :file:`records.config` and if not, a cache read will
+   fail if the object is currently be written or updated.
+
+At this point an alternate for the object is selected. This is done by comparing
+the client request to the stored response headers, but it can be controlled by a
+plugin using ``TS_HTTP_ALT_SELECT_HOOK``.
+
+The content can now be checked to see if it is :term:`stale` by calculating the
+*freshness* of the object. This is essentially checking how old the object is
+by looking at the headers and possibly other metadata (note that the headers
+can't be checked until we've selected an alternate).
+
+Most of this work is done in ``HttpTransact::what_is_document_freshness``.
+
+First, the TTL (time to live) value, which can be set in :file:`cache.config`,
+is checked if the request matches the configuration file line. This is done
+based on when the object was placed in the cache, not on any data in the
+headers.
+
+Next, an internal flag (``needs-revalidate-once``) is checked if the
+:file:`cache.config` value ``revalidate-after`` is not set, and if set the
+object is marked *stale*.
+
+After these checks the object age is calculated by ``HttpTransactHeaders::calculate_document_age``.
+and then any configured fuzzing is applied. The limits to this age based on
+available data is calculated by ``HttpTransact::calculate_document_freshness_limit``.
+
+How this age is used is determined by the :file:`records.config` setting for
+:ts:cv:`proxy.config.http.cache.when_to_revalidate`. If this is ``0`` then the
+built caclulations are used which compare the freshness limits with document
+age, modified by any of the client supplied cache control values (``max-age``,
+``min-fresh``, ``max-stale``) unless explicitly overridden in
+:file:`cache.config`.
+
+If the object is not stale then it is served to the client. If it is stale, the
+client request may be changed to an ``If Modified Since`` request to
+:term:`revalidate <revalidation>`.
+
+The request is served using a standard virtual connection tunnel (``HttpTunnel``)
+with the :cpp:class:`CacheVC` acting as the producer and the client ``NetVC``
+acting as the sink. If the request is a range request this can be modified with
+a transform to select the appropriate parts of the object or, if the request
+contains a single range, it can use the range acceleration.
+
+Range acceleration is done by consulting a fragment offset table attached to
+the earliest ``Doc`` which contains offsets for all fragments past the first.
+This allows loading the fragment containing the first requested byte immediately
+rather than performing reads on the intermediate fragments.
+
+Cache Write
+===========
+
+Writing to the cache is handled by an instance of the class :cpp:class:`CacheVC`.
+This is a virtual connection which receives data and writes it to cache, acting
+as a sink. For a standard transaction data transfers between virtual connections
+(*VConns*) are handled by :cpp:class:`HttpTunnel`. Writing to the cache is done
+by attaching a ``CacheVC`` instance as a tunnel consumer. It therefore operates
+in parallel with the virtual connection that transfers data to the client. The
+data does not flow to the cache and then to the client, it is split and goes
+both directions in parallel. This avoids any data synchronization issues between
+the two.
+
+.. sidebar:: Writing to disk
+
+   The actual write to disk is handled in a separate thread dedicated to I/O
+   operations, the AIO threads. The cache logic marshals the data and then hands
+   the operation off to the AIO thread which signals back once the operation
+   completes.
+
+While each ``CacheVC`` handles its transactions independently, they do interact
+at the :term:`volume <cache volume>` level as each ``CacheVC`` makes calls to
+the volume object to write its data to the volume content. The ``CacheVC``
+accumulates data internally until either the transaction is complete or the
+amount of data to write exceeds the target fragment size. In the former
+case the entire object is submitted to the volume to be written. In the latter
+case, a target fragment size amount of data is submitted and the ``CacheVC``
+continues to operate on subsequent data. The volume in turn places these write
+requests in an holding area called the `aggregation buffer`_.
+
+For objects under the target fragment size, there is no consideration of order,
+the object is simply written to the volume content. For larger objects, the
+earliest ``Doc`` is written first and the first ``Doc`` written last. This
+provides some detection ability should the object be overwritten. Because of
+the nature of the write cursor no fragment after the first fragment (in the
+earliest ``Doc``) can be overwritten without also overwriting that first
+fragment (since we know at the time the object was finalized in the cache the
+write cursor was at the position of the first ``Doc``).
+
+.. note::
+
+   It is the responsibility of the ``CacheVC`` to not submit writes that exceed
+   the target fragment size.
+
+.. XXX how does the write logic know if it's an original object write or an update to an existing object?
+
+Update
+------
+
+Cache write also covers the case where an existing object in the cache is
+modified. This occurs when:
+
+* A conditional request is made to the origin server and a ``304 - Not Modified``
+  response is received.
+
+* An alternate of the object is retrieved from an :term:`origin server` and
+  added to the object.
+
+* An alternate of the object is removed (e.g., due to a ``DELETE`` request).
+
+In every case the metadata for the object must be modified. Because |TS| never
+updates data already in the cache this means the first ``Doc`` will be written
+to the cache again and the volume directory entry updated. Because a client
+request has already been processed the first ``Doc`` has been read from cache
+and is in memory. The alternate vector is updated as appropriate (an entry
+added or removed, or changed to contain the new HTTP headers), and then written
+to disk. It is possible for multiple alternates to be updated by different
+``CacheVC`` instances at the same time. The only contention is the first
+``Doc``; the rest of the data for each alternate is completely independent.
+
+.. _aggregation-buffer:
+
+Aggregation Buffer
+------------------
+
+Disk writes to cache are handled through an *aggregation buffer*. There is one
+for each :cpp:class:`Vol` instance. To minimize the number of system calls data
+is written to disk in units of roughly :ref:`target fragment size <target-fragment-size>`
+bytes. The algorithm used is simple: data is piled up in the aggregation buffer
+until no more will fit without going over the target fragment size, at which
+point the buffer is written to disk and the volume directory entries for objects
+with data in the buffer are updated with the actual disk locations for those
+objects (which are determined by the write to disk action). After the buffer is
+written it is cleared and process repeats. There is a special lookup table for
+the aggregation buffer so that object lookup can find cache data in that memory.
+
+Because data in the aggregation buffer is visible to other parts of the cache,
+particularly `cache lookup`_, there is no need to push a partially filled
+aggregation buffer to disk. In effect, any such data is memory cached until
+enough additional cache content arrives to fill the buffer.
+
+The target fragment size has little effect on small objects because the fragment
+size is used only to parcel out disk write operations. For larger objects the
+effect very significant as it causes those objects to be broken up in to
+fragments at different locations on in the volume. Each fragment write has its
+own entry in the volume directory which are computationally chained (each
+:term:`cache key` is computed from the previous one). If possible, a fragment
+table is accumulated in the earliest ``Doc`` which has the offsets of the first
+byte for each fragment.
+
+.. _evacuation-mechanics:
+
+Evacuation Mechanics
+--------------------
+
+By default, the write cursor will overwrite (de facto evict from cache) objects
+as it proceeds once it has gone around the :term:`cache stripe` at least once.
+In some cases this is not acceptable and the object is *evacuated* by reading
+it from the cache and then writing it back to cache which moves the physical
+storage of the object from in front of the write cursor to behind the write
+cursor. Objects that are evacuated are handled in this way based on data in
+stripe data structures (attached to the :cpp:class:`Vol` instance).
+
+Evacuation data structures are defined by dividing up the volume content into
+a disjoint and contiguous set of regions of ``EVACUATION_BUCKET_SIZE`` bytes.
+The :cpp:member:`Vol::evacuate` member is an array with an element for each
+evacuation region. Each element is a doubly linked list of :cpp:class:`EvacuationBlock`
+instances. Each instance contains a :cpp:class:`Dir` that specifies the fragment
+to evacuate. It is assumed that an evacuation block is placed in the evacuation
+bucket (array element) that corresponds to the evacuation region in which the
+fragment is located although no ordering per bucket is enforced in the linked
+list (this sorting is handled during evacuation). Objects are evacuated by
+specifying the first or earliest fragment in the evactuation block. The
+evactuation operation will then continue the evacuation for subsequent fragments
+in the object by adding those fragments in evacuation blocks. Note that the
+actual evacuation of those fragments is delayed until the write cursor reaches
+the fragments, it is not necessarily done at the time the earliest fragment is
+evacuated.
+
+There are two types of evacuations: *reader based* and *forced*. The
+``EvacuationBlock`` has a reader count to track this. If the reader count is
+zero, then it is a forced evacuation and the the target, if it exists, will be
+evacuated when the write cursor gets close. If the reader value is non-zero
+then it is a count of entities that are currently expecting to be able to read
+the object. Readers increment the count when they require read access to the
+object, or create the ``EvacuationBlock`` with a count of 1. When a reader is
+finished with the object it decrements the count and removes the ``EvacuationBlock``
+if the count goes to zero. If the ``EvacuationBlock`` already exists with a
+count of zero, the count is not modified and the number of readers is not
+tracked, so the evacuation is valid as long as the object exists.
+
+Evacuation is driven by cache writes, essentially in :cpp:member:`Vol::aggWrite`.
+This method processes the pending cache virtual connections that are trying to
+write to the stripe. Some of these may be evacuation virtual connections. If so
+then the completion callback for that virtual connection is called as the data
+is put in to the aggregation buffer.
+
+When no more cache virtual connections can be processed (due to an empty queue
+or the aggregation buffer filling) then :cpp:member:`Vol::evac_range` is called
+to clear the range to be overwritten plus an additional :const:`EVACUATION_SIZE`
+range. The buckets covering that range are checked. If there are any items in
+the buckets a new cache virtual connection (a *doc evacuator*) is created and
+used to read the evacuation item closest to the write cursor (i.e. with the
+smallest offset in the stripe) instead of the aggregation write proceeding. When
+the read completes it is checked for validity and if valid, the cache virtual
+connection for it is placed at the front of the write queue for the stripe and
+the write aggregation resumed.
+
+Before doing a write, the method :cpp:func:`Vol::evac_range()` is called to
+start an evacuation. If any fragments are found in the buckets in the range the
+earliest such fragment (smallest offset, closest to the write cursor) is
+selected and read from disk and the aggregation buffer write is suspended. The
+read is done via a cache virtual connection which also effectively serves as the
+read buffer. Once the read is complete, that cache virtual connection instance
+(the *doc evacuator*) is placed at the front of the stripe write queue and
+written out in turn. Because the fragment data is now in memory it is acceptable
+to overwrite the disk image.
+
+Note that when normal stripe writing is resumed, this same check is done again,
+each time evauating (if needed) a fragment and queuing them for writing in turn.
+
+Updates to the directory are done when the write for the evacuated fragment
+completes. Multi-fragment objects are detected after the read completes for a
+fragment. If it is not the first fragment then the next fragment is marked for
+evacuation (which in turn, when it is read, will pull the subsequent fragment).
+The logic presumes that the end of the :term:`alternate` is when the next key
+is not in the directory.
+
+This interacts with the *one at a time* strategy of the aggregation write logic.
+If a fragment is close to the fragment being evacuated, it may end up in the
+same evacuation bucket. Because the aggregation write checks every time for the
+next fragment to evacuate it will find that next fragment and evacuate it before
+it is overwritten.
+
+.. note
+
+   I do not understand the extra key list that is present in an evacuation block. It is labeled as needed for
+   "collisions" but I am unclear on what might be colliding. The bucket entries are stored and matched by stripe offset
+   but if two fragments collide on their offset, only one can be valid. Based on how :ref:`directory probing
+   <cache-directory-probe>` works and the logic of :cpp:func:`evacuate_fragments()` it appears that rather than determine which
+   entry in a directory bucket is the correct one, all of them are marked for evacuation (thereby handling
+   "collisions"). However, each one could have a distinct fragment size and that is set for all of the reads by the
+   first fragment found in the directory. The intent seems to be to read all fragments that collide at the same starting
+   offset and then figure out which one was really on the disk after the read by looking through the key list. However,
+   this seems to presume those fragments will all be the same size, which seems unreasonable. I would think it would
+   also be necessary to update the size in the :cpp:class:`Dir` instance in the evacuation block to the be largest size
+   found among the collisions.
+
+.. _evacuation-operation:
+
+Evacuation Operation
+--------------------
+
+The primary source of fragments to be evacuated are active fragments. That is,
+fragments which are currently open for reading or writing. This is tracked by
+the reader value in the evacuation blocks noted above.
+
+If object pinning is enabled, then a scan is done on a regular basis as the
+write cursor moves to detect pinned objects and mark them for evacuation.
+
+Fragments can also be evacuated through *hit evacuation*. This is configured by
+:ts:cv:`proxy.config.cache.hit_evacuate_percent` and
+:ts:cv:`proxy.config.cache.hit_evacuate_size_limit`. When a fragment is read it
+is checked to see if it is close and in front of the write cursor, close being
+less than the specified percent of the size of the stripe. If set at the default
+value of 10, then if the fragment is withing 10% of the size of the stripe, it
+is marked for evacuation. This is cleared if the write cursor passes through the
+fragment while it remains open (as all open objects are evacuated). If, when the
+object is closed, the fragment is still marked then it is placed in the
+appropriate evacuation bucket.
+
+.. _cache-initialization:
+
+Initialization
+==============
+
+Initialization starts with an instance of :cpp:class:`Store` reading the storage
+configuration file, by default :file:`storage.config`. For each valid element in
+the file an instance of :cpp:class:`Span` is created. These are of basically
+four types:
+
+* File
+
+* Directory
+
+* Disk
+
+* Raw device
+
+After creating all the :cpp:class:`Span` instances, they are grouped by device
+ID to internal linked lists attached to the :cpp:member:`Store::disk`
+array[#store-disk-array]_. Spans that refer to the same directory, disk, or raw
+device are coalesced in to a single span. Spans that refer to the same file
+with overlapping offsets are also coalesced [#coalesced-spans]_. This is all done in
+:c:func:`ink_cache_init()` called during startup.
+
+.. note::
+
+   The span logic is also used by the HostDB and more than one otherwise
+   inexplicable feature is provided by the span logic for that module.
+
+After configuration initialization, the cache processor is started by calling
+:cpp:func:`CacheProcessor::start()`. This does a number of things:
+
+For each valid span, an instance of :cpp:class:`CacheDisk` is created. This
+class is a :term:`continuation` and so can be used to perform potentially
+blocking operations on the span. The primary use of these is to be passed to
+the AIO threads as the callback when an I/O operation completes. These are then
+dispatched to AIO threads to perform :term:`storage unit` initialization. After
+all of those have completed, the resulting storage is distributed across the
+:term:`volumes <cache volume>` in :c:func:`cplist_reconfigure`. The
+:cpp:class:`CacheVol` instances are created at this time.
+
+:term:`Cache stripe <cache stripe>` assignment setup is done once all stripes
+have initialized (that is, the stripe header information has been successfully
+read from disk for all stripes). The assignment information is stored as an
+array of indices. These are indices in to an array of stripes. Both the
+assignment and the stripe arrays are stored in an instance of :cpp:class:`CacheHostRecord`.
+Assignment initialization consists of populating the assignment array, which is
+much larger than the stripe array.
+
+There is an instance of :cpp:class:`CacheHostRecord` for each line in
+:file:`hosting.config` and one generic record. For the configured instances, the
+set of stripes is determined from the cache volume specified in the line. If no
+lines are specified, all stripes are placed in the generic record, otherwise
+only those stripes marked as default are placed in the generic record.
+
+.. note::
+
+   If hosting records are specified, it is an error to not specify at least one
+   default cache volume.
+
+The assignment table is initialized in :c:func:`build_vol_hash_table` which is
+called for each :cpp:class:`CacheHostRecord` instance. For each stripe in the
+host record, a sequence of pseudo-random numbers is generated. This begins with
+the folded hash of the stripe hash identifier, which is the device path followed
+by the ``skip`` and ``size`` values for that stripe, making it unique. This
+also makes the sequence deterministic for any particular stripe.
+
+Each stripe gets one number in its sequence for every `VOL_HASH_ALLOC_SIZE` (8
+MB currently) of storage. These numbers are paired with the stripe index,
+combined across all stripes, then sorted by the random values. The resulting
+array is sampled for every slot in the stripe assignment table by dividing the
+maximum random value by the size of the assignment table and using the value
+midway between each multiple of the result of the division. The coalesced
+psuedo-random sequence is scanned for each sample in turn and the first number
+not greater than the sample is found. The stripe associated with that value is
+used for that assignment table entry.
+
+While this procedure is deterministic, it is sensitive to initial conditions,
+including the size of each stripe.
+
+.. rubric:: Footnotes
+
+.. [#store-disk-array]
+
+   `Work is under way <https://issues.apache.org/jira/browse/TS-2020>`_ on
+   extending this to include objects that are in the memory cache.
+
+.. [#coalesced-spans]
+
+   This linked list is mostly ignored in later processing, causing all but one
+   file or directory storage units on the same device to be ignored. See
+   `TS-1869 <https://issues.apache.org/jira/browse/TS-1869>`_.
+

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/consistency.en.rst
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/consistency.en.rst b/doc/developer-guide/architecture/consistency.en.rst
new file mode 100644
index 0000000..c5d5faf
--- /dev/null
+++ b/doc/developer-guide/architecture/consistency.en.rst
@@ -0,0 +1,174 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+   or more contributor license agreements.  See the NOTICE file
+   distributed with this work for additional information
+   regarding copyright ownership.  The ASF licenses this file
+   to you under the Apache License, Version 2.0 (the
+   "License"); you may not use this file except in compliance
+   with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing,
+   software distributed under the License is distributed on an
+   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   KIND, either express or implied.  See the License for the
+   specific language governing permissions and limitations
+   under the License.
+
+.. include:: ../../common.defs
+
+.. _developer-cache-consistency:
+
+Cache Tools
+~~~~~~~~~~~
+
+Tools and techniques for cache monitoring and inspection.
+
+* :ref:`The cache inspector <inspecting-the-cache>`.
+
+Topics to be done
+~~~~~~~~~~~~~~~~~
+
+* Resident alternates
+* Object refresh
+
+Cache Consistency
+~~~~~~~~~~~~~~~~~
+
+The cache is completely consistent, up to and including kicking the power cord
+out, if the write buffer on consumer disk drives is disabled. You need to use::
+
+  hdparm -W0
+
+The cache validates that all the data for the document is available and will
+silently mark a partial document as a miss on read. There is no gentle
+shutdown for Traffic Server. You simply kill the process and the recovery code
+(fsck) is run every time Traffic Server starts up.
+
+On startup the two versions of the index are checked, and the last valid one is
+read into memory. |TS| then moves forward from the last snapped write
+cursor and reads all the fragments written to disk and updates the directory
+(as in a log-based file system). It stops reading at the write before the last
+valid write header it sees (as a write is not necessarily atomic because of
+sector reordering). Then the new updated index is written to the invalid
+version (in case of a crash during startup) and the system starts.
+
+.. _volume tagging:
+
+Volume Tagging
+~~~~~~~~~~~~~~
+
+Currently, :term:`cache volumes <cache volume>` are allocated somewhat
+arbitrarily from storage elements. `This enhancement <https://issues.apache.org/jira/browse/TS-1728>`__
+allows :file:`storage.config` to assign :term:`storage units <storage unit>` to
+specific :term:`volumes <cache volume>` although the volumes must still be
+listed in :file:`volume.config` in general and in particular to map domains to
+specific volumes. A primary use case for this is to be able to map specific
+types of content to different storage elements. This can be employed to have
+different storage devices for various types of content (SSD vs. rotational).
+
+Version Upgrade
+---------------
+
+It is currently the case that any change to the cache format will clear the
+cache. This is an issue when upgrading the |TS| version and should be kept in mind.
+
+.. _cache-key:
+
+Controlling the cache key
+-------------------------
+
+The :term:`cache key` is by default the URL of the request. There are two
+possible choices, the original (pristine) URL and the remapped URL. Which of
+these is used is determined by the configuration value
+:ts:cv:`proxy.config.url_remap.pristine_host_hdr`.
+
+This is an ``INT`` value. If set to ``0`` (disabled) then the remapped URL is
+used, and if it is not ``0`` (enabled) then the original URL is used. This
+setting also controls the value of the ``HOST`` header that is placed in the
+request sent to the :term:`origin server`, using the hostname from the original
+URL if not ``0`` and the host name from the remapped URL if ``0``. It has no
+other effects.
+
+For caching, this setting is irrelevant if no remapping is done or there is a
+one-to-one mapping between the original and remapped URLs.
+
+It becomes significant if multiple original URLs are mapped to the same
+remapped URL. If pristine headers are enabled, requests to different original
+URLs will be stored as distinct :term:`objects <cache object>` in the cache. If
+disabled, the remapped URL will be used and there may be collisions. This is
+bad if the contents different, but quite useful if they are the same (as in
+situations where the original URLs are just aliases for the same underlying
+server resource).
+
+This is also an issue if a remapping is changed because it is effectively a
+time axis version of the previous case. If an original URL is remapped to a
+different server address then the setting determines if existing cached objects
+will be served for new requests (enabled) or not (disabled). Similarly, if the
+original URL mapped to a particular URL is changed then cached objects from the
+initial original URL will be served from the updated original URL if pristine
+headers is disabled.
+
+These collisions are not by themselves good or bad. An administrator needs to
+decide which is appropriate for their situation and set the value correspondingly.
+
+If a greater degree of control is desired, a plugin must be used to invoke the
+API calls :c:func:`TSHttpTxnCacheLookupUrlSet()` or  :c:func:`TSCacheUrlSet()`
+to provide a specific :term:`cache key`. The :c:func:`TSCacheUrlSet()` API can
+be called as early as ``TS_HTTP_READ_REQUEST_HDR_HOOK`` but no later than
+``TS_HTTP_POST_REMAP_HOOK``. It can be called only once per transaction;
+calling it multiple times has no additional effect.
+
+A plugin that changes the cache key must do so consistently for both cache hit
+and cache miss requests because two different requests that map to the same
+cache key will be considered equivalent by the cache. Use of the URL directly
+provides this and so must any substitute. This is entirely the responsibility
+of the plugin; there is no way for the |TS| core to detect such an occurrence.
+
+If :c:func:`TSHttpTxnCacheLookupUrlGet()` is called after new cache url set by
+:c:func:`TSHttpTxnCacheLookupUrlSet()` or :c:func:`TSCacheUrlSet()`, it should
+use a URL location created by :c:func:`TSUrlCreate()` as its third input
+parameter instead of getting ``url_loc`` from the client request.
+
+It is a requirement that the string be syntactically a URL but otherwise it is
+completely arbitrary and need not have any path. For instance, if the company
+Network Geographics wanted to store certain content under its own
+:term:`cache key`, using a document GUID as part of the key, it could use a
+cache key like ::
+
+   ngeo://W39WaGTPnvg
+
+The scheme ``ngeo`` was picked specifically because it is not a valid URL
+scheme, and so will never collide with any valid URL.
+
+This can be useful if the URL encodes both important and unimportant data.
+Instead of storing potentially identical content under different URLs (because
+they differ on the unimportant parts) a url containing only the important parts
+could be created and used.
+
+For example, suppose the URL for Network Geographics content encoded both the
+document GUID and a referral key. ::
+
+   http://network-geographics-farm-1.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H
+
+We don't want to serve the same content for every possible referrer. Instead,
+we could use a plugin to convert this to the previous example and requests that
+differed only in the referrer key would all reference the same cache entry.
+Note that we would also map the following to the same cache key ::
+
+   http://network-geographics-farm-56.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H
+
+This can be handy for sharing content between servers when that content is
+identical. Plugins can change the cache key, or not, depending on any data in
+the request header. For instance, not changing the cache key if the request is
+not in the ``doc`` directory. If distinguishing servers is important, that can
+easily be pulled from the request URL and used in the synthetic cache key. The
+implementor is free to extract all relevant elements for use in the cache key.
+
+While there is no explicit requirement that the synthetic cache key be based on
+the HTTP request header, in practice it is generally necessary due to the
+consistency requirement. Because cache lookup happens before attempting to
+connect to the :term:`origin server`, no data from the HTTP response header is
+available, leaving only the request header. The most common case is the one
+described above where the goal is to elide elements of the URL that do not
+affect the content to minimize cache footprint and improve cache hit rates.

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/data-structures.en.rst
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/data-structures.en.rst b/doc/developer-guide/architecture/data-structures.en.rst
new file mode 100644
index 0000000..e63e1c6
--- /dev/null
+++ b/doc/developer-guide/architecture/data-structures.en.rst
@@ -0,0 +1,207 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+   or more contributor license agreements.  See the NOTICE file
+   distributed with this work for additional information
+   regarding copyright ownership.  The ASF licenses this file
+   to you under the Apache License, Version 2.0 (the
+   "License"); you may not use this file except in compliance
+   with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing,
+   software distributed under the License is distributed on an
+   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   KIND, either express or implied.  See the License for the
+   specific language governing permissions and limitations
+   under the License.
+
+.. include:: ../../common.defs
+
+.. _developer-cache-data-structures:
+
+Data Structures
+***************
+
+.. cpp:class:: OpenDir
+
+   An open directory entry. It contains all the information of a
+   :cpp:class:`Dir` plus additional information from the first :cpp:class:`Doc`.
+
+.. cpp:class:: CacheVC
+
+   A virtual connection class which accepts input for writing to cache.
+
+.. cpp:function:: int CacheVC::openReadStartHead(int event, Event* e)
+
+   Performs the initial read for a cached object.
+
+.. cpp:function:: int CacheVC::openReadStartEarliest(int event, Event* e)
+
+   Performs the initial read for an :term:`alternate` of an object.
+
+.. cpp:class:: HttpTunnel
+
+   Data transfer driver. This contains a set of *producers*. Each producer is
+   connected to one or more *consumers*. The tunnel handles events and buffers
+   so that data moves from producers to consumers. The data, as much as
+   possible, is kept in reference counted buffers so that copies are done only
+   when the data is modified or for sources (which acquire data from outside
+   |TS|) and sinks (which move data to outside |TS|).
+
+.. cpp:class:: CacheControlResult
+
+   Holds the data from a line in :file:`cache.config`.
+
+.. cpp:class:: CacheHTTPInfoVector
+
+   Defined in |P-CacheHttp.h|_. This is an array of :cpp:class:`HTTPInfo`
+   objects and serves as the respository of information about alternates of an
+   object. It is marshaled as part of the metadata for an object in the cache.
+
+.. cpp:class:: HTTPInfo
+
+   Defined in |HTTP.h|_.
+
+   This class is a wrapper for :cpp:class:`HTTPCacheAlt`. It provides the
+   external API for accessing data in the wrapped class. It contains only a
+   pointer (possibly ``NULL``) to an instance of the wrapped class.
+
+.. cpp:class:: CacheHTTPInfo
+
+   A typedef for :cpp:class:`HTTPInfo`.
+
+.. cpp:class:: HTTPCacheAlt
+
+   Defined in |HTTP.h|_.
+
+   This is the metadata for a single :term:`alternate` for a cached object. It
+   contains, among other data, the following:
+
+   * The key for the earliest ``Doc`` of the alternate.
+
+   * The request and response headers.
+
+   * The fragment offset table.[#fragment-offset-table]_
+
+   * Timestamps for request and response from :term:`origin server`.
+
+.. cpp:class:: EvacuationBlock
+
+    Record for evacuation.
+
+.. cpp:class:: Vol
+
+   This represents a :term:`storage unit` inside a :term:`cache volume`.
+
+   .. cpp:member:: off_t Vol::segments
+
+      The number of segments in the volume. This will be roughly the total
+      number of entries divided by the number of entries in a segment. It will
+      be rounded up to cover all entries.
+
+   .. cpp:member:: off_t Vol::buckets
+
+      The number of buckets in the volume. This will be roughly the number of
+      entries in a segment divided by ``DIR_DEPTH``. For currently defined
+      values this is around 16,384 (2^16 / 4). Buckets are used as the targets
+      of the index hash.
+
+   .. cpp:member:: DLL\<EvacuationBlock\> Vol::evacuate
+
+      Array of of :cpp:class:`EvacuationBlock` buckets. This is sized so there
+      is one bucket for every evacuation span.
+
+   .. cpp:member:: off_t len
+
+      Length of stripe in bytes.
+
+.. cpp:function:: int Vol::evac_range(off_t low, off_t high, int evac_phase)
+
+   Start an evacuation if there is any :cpp:class:`EvacuationBlock` in the range
+   from :arg:`low` to :arg:`high`. Return ``0`` if no evacuation was started,
+   non-zero otherwise.
+
+.. cpp:class:: CacheVol
+
+   A :term:`cache volume` as described in :file:`volume.config`.
+
+.. cpp:class:: Doc
+
+   Defined in |P-CacheVol.h|_.
+
+   .. cpp:member:: uint32_t Doc::magic
+
+      Validity check value. Set to ``DOC_MAGIC`` for a valid document.
+
+   .. cpp:member:: uint32_t Doc::len
+
+      The length of this segment including the header length, fragment table,
+      and this structure.
+
+   .. cpp:member:: uint64_t Doc::total_len
+
+      Total length of the entire document not including meta data but including
+      headers.
+
+   .. cpp:member:: INK_MD5 Doc::first_key
+
+      First index key in the document (the index key used to locate this object
+      in the volume index).
+
+   .. cpp:member:: INK_MD5 Doc::key
+
+      The index key for this fragment. Fragment keys are computationally
+      chained so that the key for the next and previous fragments can be
+      computed from this key.
+
+   .. cpp:member:: uint32_t Doc::hlen
+
+      Document header (metadata) length. This is not the length of the HTTP
+      headers.
+
+   .. cpp:member:: uint8_t Doc::ftype
+
+      Fragment type. Currently only ``CACHE_FRAG_TYPE_HTTP`` is used. Other
+      types may be used for cache extensions if those are ever implemented.
+
+   .. cpp:member:: uint24_t Doc::flen
+
+      Fragment table length, if any. Only the first ``Doc`` in an object should
+      contain a fragment table.
+
+      The fragment table is a list of offsets relative to the HTTP content (not
+      counting metadata or HTTP headers). Each offset is the byte offset of the
+      first byte in the fragment. The first element in the table is the second
+      fragment (what would be index 1 for an array). The offset for the first
+      fragment is of course always zero and so not stored. The purpose of this
+      is to enable a fast seek for range requests. Given the first ``Doc`` the
+      fragment containing the first byte in the range can be computed and loaded
+      directly without further disk access.
+
+      Removed as of version 3.3.0.
+
+   .. cpp:member:: uint32_t Doc::sync_serial
+
+      Unknown.
+
+   .. cpp:member:: uint32_t Doc::write_serial
+
+      Unknown.
+
+   .. cpp:member:: uint32_t pinned
+
+      Flag and timer for pinned objects.
+
+   .. cpp:member:: uint32_t checksum
+
+      Unknown.
+
+.. cpp:class:: VolHeaderFooter
+
+.. rubric:: Footnotes
+
+.. [#fragment-offset-table]
+
+   Changed in version 3.2.0. This previously resided in the first ``Doc`` but
+   that caused different alternates to share the same fragment table.
+

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/ats-cache-volume-definition.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/ats-cache-volume-definition.png b/doc/developer-guide/architecture/images/ats-cache-volume-definition.png
new file mode 100644
index 0000000..238d291
Binary files /dev/null and b/doc/developer-guide/architecture/images/ats-cache-volume-definition.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/ats-cache-volume-directory.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/ats-cache-volume-directory.png b/doc/developer-guide/architecture/images/ats-cache-volume-directory.png
new file mode 100644
index 0000000..631ff9e
Binary files /dev/null and b/doc/developer-guide/architecture/images/ats-cache-volume-directory.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/ats-cache-volume-layout.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/ats-cache-volume-layout.png b/doc/developer-guide/architecture/images/ats-cache-volume-layout.png
new file mode 100644
index 0000000..b05bf88
Binary files /dev/null and b/doc/developer-guide/architecture/images/ats-cache-volume-layout.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/ats-cache-write-cursor.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/ats-cache-write-cursor.png b/doc/developer-guide/architecture/images/ats-cache-write-cursor.png
new file mode 100644
index 0000000..cb31d73
Binary files /dev/null and b/doc/developer-guide/architecture/images/ats-cache-write-cursor.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-directory-structure.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-directory-structure.png b/doc/developer-guide/architecture/images/cache-directory-structure.png
new file mode 100644
index 0000000..8719d1e
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-directory-structure.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-doc-layout-3-2-0.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-doc-layout-3-2-0.png b/doc/developer-guide/architecture/images/cache-doc-layout-3-2-0.png
new file mode 100644
index 0000000..d758342
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-doc-layout-3-2-0.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-doc-layout-4-0-1.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-doc-layout-4-0-1.png b/doc/developer-guide/architecture/images/cache-doc-layout-4-0-1.png
new file mode 100644
index 0000000..d97b154
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-doc-layout-4-0-1.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-multi-fragment.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-multi-fragment.png b/doc/developer-guide/architecture/images/cache-multi-fragment.png
new file mode 100644
index 0000000..164d7d6
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-multi-fragment.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-span-layout.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-span-layout.png b/doc/developer-guide/architecture/images/cache-span-layout.png
new file mode 100644
index 0000000..317d69a
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-span-layout.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-spans.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-spans.png b/doc/developer-guide/architecture/images/cache-spans.png
new file mode 100644
index 0000000..6a5e5f8
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-spans.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/cache-stripe-layout.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/cache-stripe-layout.png b/doc/developer-guide/architecture/images/cache-stripe-layout.png
new file mode 100644
index 0000000..73b4140
Binary files /dev/null and b/doc/developer-guide/architecture/images/cache-stripe-layout.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/dir-bucket-assign.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/dir-bucket-assign.png b/doc/developer-guide/architecture/images/dir-bucket-assign.png
new file mode 100644
index 0000000..f790e8b
Binary files /dev/null and b/doc/developer-guide/architecture/images/dir-bucket-assign.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/images/dir-segment-bucket.png
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/images/dir-segment-bucket.png b/doc/developer-guide/architecture/images/dir-segment-bucket.png
new file mode 100644
index 0000000..1b50148
Binary files /dev/null and b/doc/developer-guide/architecture/images/dir-segment-bucket.png differ

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/ce162a6d/doc/developer-guide/architecture/index.en.rst
----------------------------------------------------------------------
diff --git a/doc/developer-guide/architecture/index.en.rst b/doc/developer-guide/architecture/index.en.rst
new file mode 100644
index 0000000..6752656
--- /dev/null
+++ b/doc/developer-guide/architecture/index.en.rst
@@ -0,0 +1,43 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+   or more contributor license agreements.  See the NOTICE file
+   distributed with this work for additional information
+   regarding copyright ownership.  The ASF licenses this file
+   to you under the Apache License, Version 2.0 (the
+   "License"); you may not use this file except in compliance
+   with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing,
+   software distributed under the License is distributed on an
+   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   KIND, either express or implied.  See the License for the
+   specific language governing permissions and limitations
+   under the License.
+
+.. include:: ../../common.defs
+
+.. _developer-architecture:
+
+Cache Architecture
+******************
+
+The original architectural documents for Traffic Server were lost in the
+transition to an open source project. The documents in this section are
+provisional and were written based on the existing code. The purpose is to have
+a high level description of aspects of Traffic Server to better inform ongoing
+work.
+
+In the final section on "hacking" we try to document our approaches to
+understanding and modifying the source.
+
+.. toctree::
+   :maxdepth: 2
+
+   architecture.en
+   data-structures.en
+   api-functions.en
+   consistency.en
+   ram-cache.en
+   tiered-storage.en
+