You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-dev@xml.apache.org by jb...@apache.org on 2002/11/25 17:15:25 UTC

cvs commit: xml-xindice/src/documentation/resources/images pagedfile.png

jbates      2002/11/25 08:15:25

  Modified:    src/documentation/content/xdocs/dev book.xml
  Added:       src/documentation/content/xdocs/dev guide-internals.xml
               src/documentation/resources/images pagedfile.png
  Log:
  Started "Xindice internals" documentation
  
  Revision  Changes    Path
  1.4       +5 -2      xml-xindice/src/documentation/content/xdocs/dev/book.xml
  
  Index: book.xml
  ===================================================================
  RCS file: /home/cvs/xml-xindice/src/documentation/content/xdocs/dev/book.xml,v
  retrieving revision 1.3
  retrieving revision 1.4
  diff -u -r1.3 -r1.4
  --- book.xml	24 Nov 2002 06:48:44 -0000	1.3
  +++ book.xml	25 Nov 2002 16:15:25 -0000	1.4
  @@ -18,11 +18,14 @@
      </menu>
   
      <menu label="Documentation">
  -      <menu-item label="How to contribute" href="doc-contributing.html"/>
         <menu-item label="Administrator Guide" href="guide-administrator.html"/>
         <menu-item label="User Guide" href="guide-user.html"/>
         <menu-item label="Developer Guide" href="guide-developer.html"/>
         <menu-item label="Commandline Tool Guide" href="guide-commandline.html"/>
  +   </menu>
  +   <menu label="Developer documentation">
  +      <menu-item label="How to contribute" href="doc-contributing.html"/>
  +      <menu-item label="Xindice internals" href="guide-internals.html"/>
      </menu>
   
   </book>
  
  
  
  1.1                  xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml
  
  Index: guide-internals.xml
  ===================================================================
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "document-v11.dtd">
  <!--
    - $Id: guide-internals.xml,v 1.1 2002/11/25 16:15:25 jbates Exp $ $Date: 2002/11/25 16:15:25 $
    -->
  <document>
      <header>
          <title>Xindice internals</title>
          <authors>
              <person id="jb" name="James Bates" email="james@amplexor.com"/>
          </authors>
          <notice/>
          <abstract>
              This document describes the internal organization and operation
              of the Xindice native XML database engine. It is important reading
              for those who intend to contribute to Xindice's core engine.
          </abstract>
      </header>
      <body>
          <warning>This documentation is a work in progress and is only applicable to the CVS version of Xindice.
                   Its content is based mainly on close inspection of existing source code and some
                   reverse engeneering. Some of the information may be inaccurate, or at least misleading
                   with respect to the intent of the original core developers (which isn't always obvious
                   from the source code). You have been warned.
          </warning>
          <section>
              <title>1. Overall Xindice architecture</title>
              <p>Xindice is a native XML database engine that is written
                 entirely in <em>Java</em>. As such it must always be hosted
                 by a <em>Java Virtual Machine</em> (JVM).</p>
              <p>When running, a Xindice instance stores a number of data items in
                 Java objects inside the JVM, the most important of which are:</p>
              <ul>
                  <li>Object representation of Collection hierarchy</li>
                  <li>Client connection sate information</li>
                  <li>Various cached data items</li>
              </ul>
              <p>In addition, Xindice needs access to disk files containing the XML
                 data, and related meta-data. The files are stored inside a diskfile
                 directory hierarchy, that starts somewhere called the <em>database root</em>.
              </p>
              <section>
                  <title>1.1. Access modes</title>
                  <p>Xindice can be set up to run in a JVM in two different ways,
                     depending on how clients will want to use Xindice.</p>
                  <p>In <em>embedded</em> mode, a complete Java application will
                     set up a Xindice instance in <em>its own</em> JVM. Only that one
                     Java application is able to access and manipulate the data in
                     Xindice. Clients using the XML:DB API will use something called the
                     <em>embedded driver</em> to access the Xindice instance that is running
                     inside the same JVM as the host application.</p>
                  <p>In <em>server</em> mode, Xindice is run as a standard J2EE <em>web application</em>,
                     in some <em>web application container</em>, such as <link
                     href="http://jakarta.apache.org/tomcat">Apache Tomcat</link>. In this
                     mode, the JVM hosting Xindice, is in fact the JVM running the web application
                     container. Clients connect to Xindice from different JVM's possibly located
                     on different machines, using <em>XML-RPC</em>, a <em>Remote Procedure Call</em>
                     standard designed to work on top of <em>HTTP</em> (which is why Xindice is
                     packaged as a <em>web application</em> in this mode).</p>
              </section>
          </section>
          <section>
              <title>2. Organization of Collections</title>
              <p>Logically, all XML data stored in Xindice is organized into a hierarchy
                 of <em>collections</em>. A collection is exactly what its name suggests:
                 it contains any number of XML documents, and can in addition contain its
                 own child collections, thus providing a hierarchy.</p>
              <p>The &quot;root&quot; collection is also called the <em>Database</em>. It is
                 special in that:</p>
              <ol>
                  <li>It has no parent.</li>
                  <li>It can contain no XML documents of its own. It only has child collections.</li>
              </ol>
              <p>Each collection in the database is represented in Java by an object of class
                 <code>org.apache.xindice.core.Collection</code>. As with many Java objects inside
                 Xindice, it is initialized using an <em>XML configuration description</em>,
                 a piece of XML describing the properties of the collection. This XML configuration
                 is modelled in Java as an object of class
                 <code>org.apache.xindice.util.Configuration</code>. To set up the configuration of
                 a collection, Xindice calls the collection's <code>setConfig()</code> method, passing it
                 an appropriately obtained <code>org.apache.xindice.util.Configuration</code> object.</p>
              <section>
                  <title>2.1. The Database</title>
                  <p>The database, or &quot;root&quot; collection is the Java object
                     that provides the link to everything else used by the Xindice instance.
                     When Xindice first starts, its first act is to create and initialize an
                     object of class <code>org.apache.xindice.core.Database</code>, which extends
                     <code>org.apache.xindice.core.Collection</code>.</p>
                  <p>The database object is initialized using an XML configuration file that is obtained
                     from outside the database (i.e. it is stored simply as a file somewhere). In
                     the case of an embedded Xindice instance, the file is referenced using the
                     Java property <code>xindice.configuration</code>, whereas in server mode, the file is
                     referenced by the parameter <code>xindice-configuration</code> in the Xindice web
                     application's <code>web.xml</code> file.</p>
                  <p>The format of the XML configuration file used to initialize the database object
                     is as follows:</p>
                  <source><![CDATA[
  <xindice>
     <root-collection dbroot="./db/" name="db">
        <queryengine>
           <resolver autoindex="false" class="org.apache.xindice.core.query.XPathQueryResolver" />
           <resolver class="org.apache.xindice.core.xupdate.XUpdateQueryResolver" />
        </queryengine>
     </root-collection>
  </xindice>
  ]]></source>
                  <p>In fact, if during initialization, the XML configuration file cannot be
                     found for some reason, rather that throw an error, Xindice will <em>assume</em>
                     a default configuration file, which is exactly the one shown above.</p>
                  <p>The important elements in this configuration file are:</p>
                  <ul>
                      <li>the <code>dbroot</code> attribute of the <code>root-collection</code>
                          element. It is a directory path (absolute or relative; if relative,
                          it is assumed to be relative to the <code>XINDICE_HOME</code>
                          environment variable), pointing to the <em>database root</em> directory of the
                          database instance. Recall that this is where Xindice stores all its data and
                          meta-data files.</li>
                      <li>the <code>name</code> attribute of the <code>root-collection</code> element.
                          The root collection (or database) in Xindice is not called simply &quot;/&quot; as
                          you may have expected. It actually has a name; this name is specified here.</li>
                      <li>the various <code>resolver</code> elements in the <code>query-engine</code> element. These
                          point to the <em>query engines</em>, and there Java implementations that this
                          Xindice instance should support. Out of the box, Xindice supports two query styles,
                          <em>XPath</em> and <em>XUpdate</em>. They are detailed in a subsequent chapter.</li>
                  </ul>
                  <p>Notice that this external configuration file does <em>not</em> specify
                     the <em>child collections</em> of the database. As we will see shortly,
                     these are configured in XML somewhere <em>inside</em> the database,
                     which we can now access (since we know where the database data is
                     stored).</p>
              </section>
              <section>
                  <title>2.2. The database root directory</title>
                  <p>As indicated, this directory contains all data and meta-data for the
                     XML content of the database.</p>
                  <p>The directory structure inside this database root directory reflects
                     the child collection structure of the database. So if there is a child
                     collection named <code>mycol</code> in the database, the database root
                     directory will contain a subdirectory named <code>mycol</code> and so on.</p>
                  <p>Each collection's directory contains at the minimum a file with extension
                     <code>.tbl</code> that contains <em>all</em> the XML documents stored in
                     that collection. The file is <em>not</em> human-readable. Its format
                     is explained in subsequent chapters.</p>
              </section>
              <section>
                  <title>2.3. The System Collection</title>
                  <p>One special collection, called <code>system</code>, always exists within
                     a Xindice database. When the Xindice database is initialized,
                     it automatically also loads the system collection, as this known to always exist.
                     The structure of the system collection is simple: it contains no documents of its
                     own, but contains two child collections: <code>SysConfig</code> and <code>SysSymbols</code>.</p>
                  <p>The <code>SysConfig</code> collection contains exactly one document called <code>database.xml</code>.
                     <code>SysSymbols</code> contains various documents that are in fact the <em>Symbol tables</em>
                     used for storage of the element and attribute names of all XML content in the database. The
                     symbol tables are detailed in a subsequent section.</p>
                  <p>The <code>database.xml</code> is the XML configuration file that is used to
                     initialize all other collections in the database. It is located in the database itself,
                     because it obviously needs to be updated each time collections are added or removed
                     from the database.</p>
                  <p>Its structure is as shown below: (you can check your own configuration by issueing
                     the command-line tool invocation: <code>xindice rd -c /db/system/SysConfig -n database.xml</code>)
                     </p>
                  <source><![CDATA[
  <database name="db">
      <collections>
          <collection compressed="true" name="james">
              <filer class="org.apache.xindice.core.filer.BTreeFiler" />
              <indexes>
                  <index class="org.apache.xindice.core.indexer.ValueIndexer" name="myidx" pattern="sub" />
              </indexes>
              <collections>
                  <collection compressed="true" name="sub">
                      <filer class="org.apache.xindice.core.filer.BTreeFiler" />
                      <indexes />
                  </collection>
              </collections>
          </collection>
          <collection compressed="true" name="james_sub">
              <filer class="org.apache.xindice.core.filer.BTreeFiler" />
              <indexes />
          </collection>
      </collections>
  </database>
                  ]]></source>
                  <p>As you can plainly see, it is here that the XML configuration for
                     all remaining collections is stored. Please note that the
                     <code>system</code> collection is not mentioned here. The XML configuration
                     for the <code>system</code> is hard-coded into Xindice as follows:</p>
                  <source><![CDATA[
  <collection name="system">
      <!-- No filer for system collection: it contains no doucments itself -->
      <collections>
          <collection name="SysSymbols" compressed="true">
              <filer class="org.apache.xindice.core.filer.BTreeFiler" />
              <symbols>
                  <symbol name="symbols" id="0" />
                  <symbol name="symbol" id="1" />
                  <symbol name="name" id="2" />
                  <symbol name="id" id="3" />
                  <symbol name="nsuri" id="4" />
              </symbols>
          </collection>
          <collection name="SysConfig" compressed="false">
              <filer class="org.apache.xindice.core.filer.BTreeFiler" />
          </collection>"
      </collections>"
  </collection>
  ]]></source>
                  <p>The only way to modify this configuration is to change
                     the Xindice source code and recompile.</p>
              </section>
              <section>
                  <title>2.4. Other Collections</title>
                  <p>The XML Configuration data used to initialize collection
                     objects in Java (of
                     class <code>org.apache.xindice.core.Collection</code>) is, as shown in
                     the examples above, located in an XML document in the system
                     collection, or, in the case of the system collection itself,
                     hard-coded into Xindice.</p>
                  <p>The important aspects of this configuration data are:</p>
                  <ul>
                      <li>A collection is represented by a <code>collection</code> configuration
                          element. The <code>name</code> attribute indicates the collection's name.
                          The <code>compressed</code> attribute, usually <code>true</code> indicates
                          how XML data should be encoded by the collection's <em>filer</em>. More
                          on this in a subsequent chapter.</li>
                      <li>Each <code>collection</code> element contains a <code>filer</code> element.
                          This element basically tells the collection the name of a Java class
                          that it can use to read its own <em>data file</em>. (The file with a
                          <code>.tbl</code> extension mentioned earlier).
                          <code>org.apache.xindice.core.filer.BTreeFiler</code> is the most common,
                          and indeed the standard filer class used in Xindice. If a collection has
                          no filer (because the <code>filer</code> element is missing, or because the Java
                          class it points to couldn't be loaded), then it won't be able to store
                          any XML documents. That's the case for example of the database (root collection),
                          and the system collection.</li>
                      <li>A <code>collection</code> element <em>may</em> contain a
                          <code>collections</code> element, which contains one <code>collection</code>
                          element per child collection of the collection under discussion.
                      </li>
                  </ul>
              </section>
          </section>
          <section>
              <title>3. Data storage</title>
              <p>The XML data contained in the XML documents of a collection is
                 stored in one single data file with extension <code>.tbl</code> that
                 is located in the collection's directory somewhere in the database
                 root directory. A special Java class called a <em>filer</em> is responsable
                 for reading and writing XML data to such a data file.</p>
              <p>In this chapter and the next, we will examine the most common filer in Xindice,
                 implemented by the <code>org.apache.xindice.core.filer.BTreeFiler</code> class.
                 The mechanism is somewhat complex, but it is broken down into a number
                 of superimposed <em>layers</em>, each layer implementing some abstract
                 data structure designied to improve overall performance.</p>
              <p>In this chapter, we will concern ourselves not with XML storage directly,
                 but the more abstract process of storage of (key,value) pairs. In essence,
                 a Xindice file is no more that a performance-oriented format for storing
                 (key,value) pairs. We will see later on, that this data format is used not
                 only for the storage of XML data itself, but also for the storage of
                 <em>indexes.</em></p>
              <p>The <code>org.apache.xindice.core.filer.BTreeFiler</code> class breaks
                 up data storage into two layers. The bottom-most layer provides a
                 <em>paged file</em> implementation. The code for this layer is located
                 in <code>org.apache.core.filer.Paged</code>. On top of the paging layer
                 sits a <em>B+-Tree</em> implementation: a balanced tree data structure
                 (which allows storage of (key,value) pairs) specially optimized for disk
                 access.</p>
              <section>
                  <title>3.1. Paged file</title>
                  <p>Paging provides efficient access to a random-access file by allowing
                     parts of the file (<em>pages</em>) to be "mapped" to main memory for
                     easy access. Pages have a fixed length. If data that must be stored is
                     longer than the length of one page, subsequent pages in the file can
                     be "linked" to the first.</p>
                  <figure src="/images/pagedfile.png" alt="Paged file structure"/>
                  <p>As shown in the diagram, a paged file consists of a file header, followed by
                     a list of fixed-length pages. The file header is 4kb long, and each page is,
                     by default, 4kb long. (These values can be modified in the
                     <code>org.apache.core.filer.Page</code> source code).</p>
                  <p>Each page contains a 64-byte header, followed by actual data. Pages are
                     numbered. Whenever a particular page, say page <code>n</code> is needed, but not yet loaded into
                     memory, the Java code can calculate the start address of the page as:</p>
                     <source>offset = fileHeaderSize + (n * pageSize)</source>
                  <p>At this address, it will then find the header of the wanted page, and 64 bytes
                     further, the start of the page's data.</p>
                  <section>
                      <title>3.1.1. Paged file header</title>
                      <figure src="/images/pagedfilehdr.png" alt="File header structure"/>
                      <p>The meaning of the various fields in the file header, whose structure
                         is shown above, is as follows:</p>
                      <ul>
                          <li>header size (2 bytes): set to 4096 (0x1000), the size of this header.</li>
                          <li>page size (4 bytes): set to 4096 (0x00001000), the page size.</li>
                          <li>page count (8 bytes): number of <em>used</em> pages in this data file.</li>
                          <li>total page count (8 bytes): number of <em>used and unused</em> pages in this data file.</li>
                          <li>first free page (8 bytes): page number of the first unused page in this file. (see below)</li>
                          <li>last free page (8 bytes): page number of the last unused page in this file. (see below)</li>
                          <li>page header size (1 byte): size of each page header. Set to 64 (0x40) by default.</li>
                          <li>max key size (2 bytes): see below.</li>
                          <li>record count (8 bytes): number of <em>records</em> stored in this file. (see below)</li>
                      </ul>
                  </section>
                  <section>
                      <title>3.1.2. Page header</title>
                      <figure src="/images/pagehdr.png" alt="Page header structure"/>
                      <p>In addition, each page has its own header, whose structure is shown above.
                         The meaning of the various fields is as follows:</p>
                      <ul><li>status (1 byte): Pages in the data file are either <em>used</em> or
                              <em>unused</em>. <em>Used</em> pages contain actual data.<br/>
                              When a data value (called <em>record</em> in paging terminology) has to be
                                 written to the paged file that is longer than the page size will allow (i.e. longer
                                 than the page size - page header size), a new page is allocated for the part of the value
                                 that doesn't fit in the first page. The header of the first page's next page field is then
                                 set to the page number of this new page. If the data is still too long for both pages,
                                 a third one is allocated and pointed to by the second, and so on.<br/>
                              When a page is unused, its status is set to <code>UNUSED</code>, encoded by the
                                 value 0x00. If a page is used to store the <em>first</em> part of a record, (or only
                                 part if the record is short enough), the status is set to a value that is used by the
                                 B-Tree algorithm. If, on the other hand, the page contains the the <em>remainder</em> of
                                 some record started in some other page, its status is set to <code>OVERFLOW</code>, encoded
                                 by the value 0x7E.</li>
                          <li>key length (2 bytes): pages have the possibility of storing a <em>key</em>
                                 just before their actual data. The maximum length of such a key is set in the
                                 file header's max key size field to 256 (0x0100). The <em>actual size</em> of the
                                 key located in this page, if any, is set in this field. Data for this
                                 page thus begins <code>64 + (key_length)</code> bytes beyond the start of the page.<br/>
                              If no key is used in this page, this field is 0.</li>
                          <li>key hash (4 bytes): As the name suggests, this field stores a 32-bit hash value calculated
                                 from the key. Used to optimize searches mainly.</li>
                          <li>data len (4 bytes): The length of the data stored <em>in this page</em>. If some of the data
                                 of the record stored here continues into a subsequent page, this field contains the length
                                 of the data stored in <em>this page</em> only.</li>
                          <li>record len (4 bytes): the total length of the data record of which part
                                 is stored in this page.</li>
                          <li>next page (8 bytes): page number of the page that contains subsequent
                                 data for the record stored in this page, if more data is available.<br/>
                              If this is the last page that stores data for the record, this field contains
                                 -1 (0xFFFFFFFF).</li>
                      </ul>
                  </section>
  
  
  
              </section>
              <section>
                  <title>3.2. The B+-Tree storage format</title>
              </section>
          </section>
          <section>
              <title>4. XML storage</title>
              <section>
                  <title>4.1. The symbol tables</title>
              </section>
              <section>
                  <title>4.2. The Compressed DOM</title>
              </section>
          </section>
          <section>
              <title>5. Queries</title>
              <section>
                  <title>5.1. XPath Queries</title>
              </section>
              <section>
                  <title>5.2. XUpdate Queries</title>
              </section>
          </section>
          <section>
              <title>6. Indexes</title>
          </section>
          <section>
              <title>7. The XML:DB drivers</title>
              <section>
                  <title>7.1. Embedded driver</title>
              </section>
              <section>
                  <title>7.2. XML-RPC driver</title>
              </section>
          </section>
  
      </body>
  </document>
  
  
  
  1.1                  xml-xindice/src/documentation/resources/images/pagedfile.png
  
  	<<Binary file>>