You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-dev@xml.apache.org by jb...@apache.org on 2002/11/25 17:15:25 UTC
cvs commit: xml-xindice/src/documentation/resources/images pagedfile.png
jbates 2002/11/25 08:15:25
Modified: src/documentation/content/xdocs/dev book.xml
Added: src/documentation/content/xdocs/dev guide-internals.xml
src/documentation/resources/images pagedfile.png
Log:
Started "Xindice internals" documentation
Revision Changes Path
1.4 +5 -2 xml-xindice/src/documentation/content/xdocs/dev/book.xml
Index: book.xml
===================================================================
RCS file: /home/cvs/xml-xindice/src/documentation/content/xdocs/dev/book.xml,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- book.xml 24 Nov 2002 06:48:44 -0000 1.3
+++ book.xml 25 Nov 2002 16:15:25 -0000 1.4
@@ -18,11 +18,14 @@
</menu>
<menu label="Documentation">
- <menu-item label="How to contribute" href="doc-contributing.html"/>
<menu-item label="Administrator Guide" href="guide-administrator.html"/>
<menu-item label="User Guide" href="guide-user.html"/>
<menu-item label="Developer Guide" href="guide-developer.html"/>
<menu-item label="Commandline Tool Guide" href="guide-commandline.html"/>
+ </menu>
+ <menu label="Developer documentation">
+ <menu-item label="How to contribute" href="doc-contributing.html"/>
+ <menu-item label="Xindice internals" href="guide-internals.html"/>
</menu>
</book>
1.1 xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml
Index: guide-internals.xml
===================================================================
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "document-v11.dtd">
<!--
- $Id: guide-internals.xml,v 1.1 2002/11/25 16:15:25 jbates Exp $ $Date: 2002/11/25 16:15:25 $
-->
<document>
<header>
<title>Xindice internals</title>
<authors>
<person id="jb" name="James Bates" email="james@amplexor.com"/>
</authors>
<notice/>
<abstract>
This document describes the internal organization and operation
of the Xindice native XML database engine. It is important reading
for those who intend to contribute to Xindice's core engine.
</abstract>
</header>
<body>
<warning>This documentation is a work in progress and is only applicable to the CVS version of Xindice.
Its content is based mainly on close inspection of existing source code and some
reverse engeneering. Some of the information may be inaccurate, or at least misleading
with respect to the intent of the original core developers (which isn't always obvious
from the source code). You have been warned.
</warning>
<section>
<title>1. Overall Xindice architecture</title>
<p>Xindice is a native XML database engine that is written
entirely in <em>Java</em>. As such it must always be hosted
by a <em>Java Virtual Machine</em> (JVM).</p>
<p>When running, a Xindice instance stores a number of data items in
Java objects inside the JVM, the most important of which are:</p>
<ul>
<li>Object representation of Collection hierarchy</li>
<li>Client connection sate information</li>
<li>Various cached data items</li>
</ul>
<p>In addition, Xindice needs access to disk files containing the XML
data, and related meta-data. The files are stored inside a diskfile
directory hierarchy, that starts somewhere called the <em>database root</em>.
</p>
<section>
<title>1.1. Access modes</title>
<p>Xindice can be set up to run in a JVM in two different ways,
depending on how clients will want to use Xindice.</p>
<p>In <em>embedded</em> mode, a complete Java application will
set up a Xindice instance in <em>its own</em> JVM. Only that one
Java application is able to access and manipulate the data in
Xindice. Clients using the XML:DB API will use something called the
<em>embedded driver</em> to access the Xindice instance that is running
inside the same JVM as the host application.</p>
<p>In <em>server</em> mode, Xindice is run as a standard J2EE <em>web application</em>,
in some <em>web application container</em>, such as <link
href="http://jakarta.apache.org/tomcat">Apache Tomcat</link>. In this
mode, the JVM hosting Xindice, is in fact the JVM running the web application
container. Clients connect to Xindice from different JVM's possibly located
on different machines, using <em>XML-RPC</em>, a <em>Remote Procedure Call</em>
standard designed to work on top of <em>HTTP</em> (which is why Xindice is
packaged as a <em>web application</em> in this mode).</p>
</section>
</section>
<section>
<title>2. Organization of Collections</title>
<p>Logically, all XML data stored in Xindice is organized into a hierarchy
of <em>collections</em>. A collection is exactly what its name suggests:
it contains any number of XML documents, and can in addition contain its
own child collections, thus providing a hierarchy.</p>
<p>The "root" collection is also called the <em>Database</em>. It is
special in that:</p>
<ol>
<li>It has no parent.</li>
<li>It can contain no XML documents of its own. It only has child collections.</li>
</ol>
<p>Each collection in the database is represented in Java by an object of class
<code>org.apache.xindice.core.Collection</code>. As with many Java objects inside
Xindice, it is initialized using an <em>XML configuration description</em>,
a piece of XML describing the properties of the collection. This XML configuration
is modelled in Java as an object of class
<code>org.apache.xindice.util.Configuration</code>. To set up the configuration of
a collection, Xindice calls the collection's <code>setConfig()</code> method, passing it
an appropriately obtained <code>org.apache.xindice.util.Configuration</code> object.</p>
<section>
<title>2.1. The Database</title>
<p>The database, or "root" collection is the Java object
that provides the link to everything else used by the Xindice instance.
When Xindice first starts, its first act is to create and initialize an
object of class <code>org.apache.xindice.core.Database</code>, which extends
<code>org.apache.xindice.core.Collection</code>.</p>
<p>The database object is initialized using an XML configuration file that is obtained
from outside the database (i.e. it is stored simply as a file somewhere). In
the case of an embedded Xindice instance, the file is referenced using the
Java property <code>xindice.configuration</code>, whereas in server mode, the file is
referenced by the parameter <code>xindice-configuration</code> in the Xindice web
application's <code>web.xml</code> file.</p>
<p>The format of the XML configuration file used to initialize the database object
is as follows:</p>
<source><![CDATA[
<xindice>
<root-collection dbroot="./db/" name="db">
<queryengine>
<resolver autoindex="false" class="org.apache.xindice.core.query.XPathQueryResolver" />
<resolver class="org.apache.xindice.core.xupdate.XUpdateQueryResolver" />
</queryengine>
</root-collection>
</xindice>
]]></source>
<p>In fact, if during initialization, the XML configuration file cannot be
found for some reason, rather that throw an error, Xindice will <em>assume</em>
a default configuration file, which is exactly the one shown above.</p>
<p>The important elements in this configuration file are:</p>
<ul>
<li>the <code>dbroot</code> attribute of the <code>root-collection</code>
element. It is a directory path (absolute or relative; if relative,
it is assumed to be relative to the <code>XINDICE_HOME</code>
environment variable), pointing to the <em>database root</em> directory of the
database instance. Recall that this is where Xindice stores all its data and
meta-data files.</li>
<li>the <code>name</code> attribute of the <code>root-collection</code> element.
The root collection (or database) in Xindice is not called simply "/" as
you may have expected. It actually has a name; this name is specified here.</li>
<li>the various <code>resolver</code> elements in the <code>query-engine</code> element. These
point to the <em>query engines</em>, and there Java implementations that this
Xindice instance should support. Out of the box, Xindice supports two query styles,
<em>XPath</em> and <em>XUpdate</em>. They are detailed in a subsequent chapter.</li>
</ul>
<p>Notice that this external configuration file does <em>not</em> specify
the <em>child collections</em> of the database. As we will see shortly,
these are configured in XML somewhere <em>inside</em> the database,
which we can now access (since we know where the database data is
stored).</p>
</section>
<section>
<title>2.2. The database root directory</title>
<p>As indicated, this directory contains all data and meta-data for the
XML content of the database.</p>
<p>The directory structure inside this database root directory reflects
the child collection structure of the database. So if there is a child
collection named <code>mycol</code> in the database, the database root
directory will contain a subdirectory named <code>mycol</code> and so on.</p>
<p>Each collection's directory contains at the minimum a file with extension
<code>.tbl</code> that contains <em>all</em> the XML documents stored in
that collection. The file is <em>not</em> human-readable. Its format
is explained in subsequent chapters.</p>
</section>
<section>
<title>2.3. The System Collection</title>
<p>One special collection, called <code>system</code>, always exists within
a Xindice database. When the Xindice database is initialized,
it automatically also loads the system collection, as this known to always exist.
The structure of the system collection is simple: it contains no documents of its
own, but contains two child collections: <code>SysConfig</code> and <code>SysSymbols</code>.</p>
<p>The <code>SysConfig</code> collection contains exactly one document called <code>database.xml</code>.
<code>SysSymbols</code> contains various documents that are in fact the <em>Symbol tables</em>
used for storage of the element and attribute names of all XML content in the database. The
symbol tables are detailed in a subsequent section.</p>
<p>The <code>database.xml</code> is the XML configuration file that is used to
initialize all other collections in the database. It is located in the database itself,
because it obviously needs to be updated each time collections are added or removed
from the database.</p>
<p>Its structure is as shown below: (you can check your own configuration by issueing
the command-line tool invocation: <code>xindice rd -c /db/system/SysConfig -n database.xml</code>)
</p>
<source><![CDATA[
<database name="db">
<collections>
<collection compressed="true" name="james">
<filer class="org.apache.xindice.core.filer.BTreeFiler" />
<indexes>
<index class="org.apache.xindice.core.indexer.ValueIndexer" name="myidx" pattern="sub" />
</indexes>
<collections>
<collection compressed="true" name="sub">
<filer class="org.apache.xindice.core.filer.BTreeFiler" />
<indexes />
</collection>
</collections>
</collection>
<collection compressed="true" name="james_sub">
<filer class="org.apache.xindice.core.filer.BTreeFiler" />
<indexes />
</collection>
</collections>
</database>
]]></source>
<p>As you can plainly see, it is here that the XML configuration for
all remaining collections is stored. Please note that the
<code>system</code> collection is not mentioned here. The XML configuration
for the <code>system</code> is hard-coded into Xindice as follows:</p>
<source><![CDATA[
<collection name="system">
<!-- No filer for system collection: it contains no doucments itself -->
<collections>
<collection name="SysSymbols" compressed="true">
<filer class="org.apache.xindice.core.filer.BTreeFiler" />
<symbols>
<symbol name="symbols" id="0" />
<symbol name="symbol" id="1" />
<symbol name="name" id="2" />
<symbol name="id" id="3" />
<symbol name="nsuri" id="4" />
</symbols>
</collection>
<collection name="SysConfig" compressed="false">
<filer class="org.apache.xindice.core.filer.BTreeFiler" />
</collection>"
</collections>"
</collection>
]]></source>
<p>The only way to modify this configuration is to change
the Xindice source code and recompile.</p>
</section>
<section>
<title>2.4. Other Collections</title>
<p>The XML Configuration data used to initialize collection
objects in Java (of
class <code>org.apache.xindice.core.Collection</code>) is, as shown in
the examples above, located in an XML document in the system
collection, or, in the case of the system collection itself,
hard-coded into Xindice.</p>
<p>The important aspects of this configuration data are:</p>
<ul>
<li>A collection is represented by a <code>collection</code> configuration
element. The <code>name</code> attribute indicates the collection's name.
The <code>compressed</code> attribute, usually <code>true</code> indicates
how XML data should be encoded by the collection's <em>filer</em>. More
on this in a subsequent chapter.</li>
<li>Each <code>collection</code> element contains a <code>filer</code> element.
This element basically tells the collection the name of a Java class
that it can use to read its own <em>data file</em>. (The file with a
<code>.tbl</code> extension mentioned earlier).
<code>org.apache.xindice.core.filer.BTreeFiler</code> is the most common,
and indeed the standard filer class used in Xindice. If a collection has
no filer (because the <code>filer</code> element is missing, or because the Java
class it points to couldn't be loaded), then it won't be able to store
any XML documents. That's the case for example of the database (root collection),
and the system collection.</li>
<li>A <code>collection</code> element <em>may</em> contain a
<code>collections</code> element, which contains one <code>collection</code>
element per child collection of the collection under discussion.
</li>
</ul>
</section>
</section>
<section>
<title>3. Data storage</title>
<p>The XML data contained in the XML documents of a collection is
stored in one single data file with extension <code>.tbl</code> that
is located in the collection's directory somewhere in the database
root directory. A special Java class called a <em>filer</em> is responsable
for reading and writing XML data to such a data file.</p>
<p>In this chapter and the next, we will examine the most common filer in Xindice,
implemented by the <code>org.apache.xindice.core.filer.BTreeFiler</code> class.
The mechanism is somewhat complex, but it is broken down into a number
of superimposed <em>layers</em>, each layer implementing some abstract
data structure designied to improve overall performance.</p>
<p>In this chapter, we will concern ourselves not with XML storage directly,
but the more abstract process of storage of (key,value) pairs. In essence,
a Xindice file is no more that a performance-oriented format for storing
(key,value) pairs. We will see later on, that this data format is used not
only for the storage of XML data itself, but also for the storage of
<em>indexes.</em></p>
<p>The <code>org.apache.xindice.core.filer.BTreeFiler</code> class breaks
up data storage into two layers. The bottom-most layer provides a
<em>paged file</em> implementation. The code for this layer is located
in <code>org.apache.core.filer.Paged</code>. On top of the paging layer
sits a <em>B+-Tree</em> implementation: a balanced tree data structure
(which allows storage of (key,value) pairs) specially optimized for disk
access.</p>
<section>
<title>3.1. Paged file</title>
<p>Paging provides efficient access to a random-access file by allowing
parts of the file (<em>pages</em>) to be "mapped" to main memory for
easy access. Pages have a fixed length. If data that must be stored is
longer than the length of one page, subsequent pages in the file can
be "linked" to the first.</p>
<figure src="/images/pagedfile.png" alt="Paged file structure"/>
<p>As shown in the diagram, a paged file consists of a file header, followed by
a list of fixed-length pages. The file header is 4kb long, and each page is,
by default, 4kb long. (These values can be modified in the
<code>org.apache.core.filer.Page</code> source code).</p>
<p>Each page contains a 64-byte header, followed by actual data. Pages are
numbered. Whenever a particular page, say page <code>n</code> is needed, but not yet loaded into
memory, the Java code can calculate the start address of the page as:</p>
<source>offset = fileHeaderSize + (n * pageSize)</source>
<p>At this address, it will then find the header of the wanted page, and 64 bytes
further, the start of the page's data.</p>
<section>
<title>3.1.1. Paged file header</title>
<figure src="/images/pagedfilehdr.png" alt="File header structure"/>
<p>The meaning of the various fields in the file header, whose structure
is shown above, is as follows:</p>
<ul>
<li>header size (2 bytes): set to 4096 (0x1000), the size of this header.</li>
<li>page size (4 bytes): set to 4096 (0x00001000), the page size.</li>
<li>page count (8 bytes): number of <em>used</em> pages in this data file.</li>
<li>total page count (8 bytes): number of <em>used and unused</em> pages in this data file.</li>
<li>first free page (8 bytes): page number of the first unused page in this file. (see below)</li>
<li>last free page (8 bytes): page number of the last unused page in this file. (see below)</li>
<li>page header size (1 byte): size of each page header. Set to 64 (0x40) by default.</li>
<li>max key size (2 bytes): see below.</li>
<li>record count (8 bytes): number of <em>records</em> stored in this file. (see below)</li>
</ul>
</section>
<section>
<title>3.1.2. Page header</title>
<figure src="/images/pagehdr.png" alt="Page header structure"/>
<p>In addition, each page has its own header, whose structure is shown above.
The meaning of the various fields is as follows:</p>
<ul><li>status (1 byte): Pages in the data file are either <em>used</em> or
<em>unused</em>. <em>Used</em> pages contain actual data.<br/>
When a data value (called <em>record</em> in paging terminology) has to be
written to the paged file that is longer than the page size will allow (i.e. longer
than the page size - page header size), a new page is allocated for the part of the value
that doesn't fit in the first page. The header of the first page's next page field is then
set to the page number of this new page. If the data is still too long for both pages,
a third one is allocated and pointed to by the second, and so on.<br/>
When a page is unused, its status is set to <code>UNUSED</code>, encoded by the
value 0x00. If a page is used to store the <em>first</em> part of a record, (or only
part if the record is short enough), the status is set to a value that is used by the
B-Tree algorithm. If, on the other hand, the page contains the the <em>remainder</em> of
some record started in some other page, its status is set to <code>OVERFLOW</code>, encoded
by the value 0x7E.</li>
<li>key length (2 bytes): pages have the possibility of storing a <em>key</em>
just before their actual data. The maximum length of such a key is set in the
file header's max key size field to 256 (0x0100). The <em>actual size</em> of the
key located in this page, if any, is set in this field. Data for this
page thus begins <code>64 + (key_length)</code> bytes beyond the start of the page.<br/>
If no key is used in this page, this field is 0.</li>
<li>key hash (4 bytes): As the name suggests, this field stores a 32-bit hash value calculated
from the key. Used to optimize searches mainly.</li>
<li>data len (4 bytes): The length of the data stored <em>in this page</em>. If some of the data
of the record stored here continues into a subsequent page, this field contains the length
of the data stored in <em>this page</em> only.</li>
<li>record len (4 bytes): the total length of the data record of which part
is stored in this page.</li>
<li>next page (8 bytes): page number of the page that contains subsequent
data for the record stored in this page, if more data is available.<br/>
If this is the last page that stores data for the record, this field contains
-1 (0xFFFFFFFF).</li>
</ul>
</section>
</section>
<section>
<title>3.2. The B+-Tree storage format</title>
</section>
</section>
<section>
<title>4. XML storage</title>
<section>
<title>4.1. The symbol tables</title>
</section>
<section>
<title>4.2. The Compressed DOM</title>
</section>
</section>
<section>
<title>5. Queries</title>
<section>
<title>5.1. XPath Queries</title>
</section>
<section>
<title>5.2. XUpdate Queries</title>
</section>
</section>
<section>
<title>6. Indexes</title>
</section>
<section>
<title>7. The XML:DB drivers</title>
<section>
<title>7.1. Embedded driver</title>
</section>
<section>
<title>7.2. XML-RPC driver</title>
</section>
</section>
</body>
</document>
1.1 xml-xindice/src/documentation/resources/images/pagedfile.png
<<Binary file>>