You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by gs...@apache.org on 2006/11/27 01:00:49 UTC
svn commit: r479465 [2/4] - in /lucene/java/trunk: docs/ docs/images/
docs/lucene-sandbox/ docs/styles/ src/site/ src/site/src/
src/site/src/documentation/ src/site/src/documentation/classes/
src/site/src/documentation/conf/ src/site/src/documentation/...
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml?view=auto&rev=479465
==============================================================================
--- lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml (added)
+++ lucene/java/trunk/src/site/src/documentation/content/xdocs/demo4.xml Sun Nov 26 16:00:46 2006
@@ -0,0 +1,160 @@
+<?xml version="1.0"?>
+<document>
+ <header>
+ <title>
+ Apache Lucene - Basic Demo Sources Walkthrough
+ </title>
+ </header>
+<properties>
+<author email="acoliver@apache.org">Andrew C. Oliver</author>
+</properties>
+<body>
+
+<section id="About the Code"><title>About the Code</title>
+<p>
+In this section we walk through the sources behind the basic Lucene Web Application demo: where to
+find them, their parts and their function. This section is intended for Java developers wishing to
+understand how to use Lucene in their applications or for those involved in deploying web
+applications based on Lucene.
+</p>
+</section>
+
+
+<section id="Location of the source (developers/deployers)"><title>Location of the source (developers/deployers)</title>
+<p>
+Relative to the directory created when you extracted Lucene or retrieved it from Subversion, you
+should see a directory called <code>src</code> which in turn contains a directory called
+<code>jsp</code>. This is the root for all of the Lucene web demo.
+</p>
+<p>
+Within this directory you should see <code>index.jsp</code>. Bring this up in vi or your editor of
+choice.
+</p>
+</section>
+
+<section id="index.jsp (developers/deployers)"><title>index.jsp (developers/deployers)</title>
+<p>
+This jsp page is pretty boring by itself. All it does is include a header, display a form and
+include a footer. If you look at the form, it has two fields: <code>query</code> (where you enter
+your search criteria) and <code>maxresults</code> where you specify the number of results per page.
+By the structure of this JSP it should be easy to customize it without even editing this particular
+file. You could simply change the header and footer. Let's look at the <code>header.jsp</code>
+(located in the same directory) next.
+</p>
+</section>
+
+<section id="header.jsp (developers/deployers)"><title>header.jsp (developers/deployers)</title>
+<p>
+The header is also very simple by itself. The only thing it does is include the
+<code>configuration.jsp</code> (which you looked at in the last section of this guide) and set the
+title and a brief header. This would be a good place to put your own custom HTML to "pretty" things
+up a bit. We won't cover the footer because all it does is display the footer and close your tags.
+Let's look at the <code>results.jsp</code>, the meat of this application, next.
+</p>
+</section>
+
+<section id="results.jsp (developers)"><title>results.jsp (developers)</title>
+<p>
+Most of the functionality lies in <code>results.jsp</code>. Much of it is for paging the search
+results, which we'll not cover here as it's commented well enough. The first thing in this page is
+the actual imports for the Lucene classes and Lucene demo classes. These classes are loaded from
+the jars included in the <code>WEB-INF/lib</code> directory in the <code>luceneweb.war</code> file.
+</p>
+<p>
+You'll notice that this file includes the same header and footer as <code>index.jsp</code>. From
+there it constructs an <code><a
+href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a></code> with the
+<code>indexLocation</code> that was specified in <code>configuration.jsp</code>. If there is an
+error of any kind in opening the index, it is displayed to the user and the boolean flag
+<code>error</code> is set to tell the rest of the sections of the jsp not to continue.
+</p>
+<p>
+From there, this jsp attempts to get the search criteria, the start index (used for paging) and the
+maximum number of results per page. If the maximum results per page is not set or not valid then it
+and the start index are set to default values. If only the start index is invalid it is set to a
+default value. If the criteria isn't provided then a servlet error is thrown (it is assumed that
+this is the result of url tampering or some form of browser malfunction).
+</p>
+<p>
+The jsp moves on to construct a <code><a
+href="api/org/apache/lucene/analysis/standard/StandardAnalyzer.html">StandardAnalyzer</a></code> to
+analyze the search text. This matches the analyzer used during indexing (<code><a
+href="api/org/apache/lucene/demo/IndexHTML.html">IndexHTML</a></code>), which is generally
+recommended. This is passed to the <code><a
+href="api/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a></code> along with the
+criteria to construct a <code><a href="api/org/apache/lucene/search/Query.html">Query</a></code>
+object. You'll also notice the string literal <code>"contents"</code> included. This specifies
+that the search should cover the <code>contents</code> field and not the <code>title</code>,
+<code>url</code> or some other field in the indexed documents. If there is any error in
+constructing a <code><a href="api/org/apache/lucene/search/Query.html">Query</a></code> object an
+error is displayed to the user.
+</p>
+<p>
+In the next section of the jsp the <code><a
+href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a></code> is asked to search
+given the query object. The results are returned in a collection called <code>hits</code>. If the
+length property of the <code>hits</code> collection is 0 (meaning there were no results) then an
+error is displayed to the user and the error flag is set.
+</p>
+<p>
+Finally the jsp iterates through the <code>hits</code> collection, taking the current page into
+account, and displays properties of the <code><a
+href="api/org/apache/lucene/document/Document.html">Document</a></code> objects we talked about in
+the first walkthrough. These objects contain "known" fields specific to their indexer (in this case
+<code><a href="api/org/apache/lucene/demo/IndexHTML.html">IndexHTML</a></code> constructs a document
+with "url", "title" and "contents").
+</p>
+<p>
+Please note that in a real deployment of Lucene, it's best to instantiate <code><a
+href="api/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a></code> and <code><a
+href="api/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a></code> once, and then
+share them across search requests, instead of re-instantiating per search request.
+</p>
+</section>
+
+<section id="More sources (developers)"><title>More sources (developers)</title>
+<p>
+There are additional sources used by the web app that were not specifically covered by either
+walkthrough. For example the HTML parser, the <code><a
+href="api/org/apache/lucene/demo/IndexHTML.html">IndexHTML</a></code> class and <code><a
+href="api/org/apache/lucene/demo/HTMLDocument.html">HTMLDocument</a></code> class. These are very
+similar to the classes covered in the first example, with properties specific to parsing and
+indexing HTML. This is beyond our scope; however, by now you should feel like you're "getting
+started" with Lucene.
+</p>
+</section>
+
+<section id="Where to go from here? (everyone!)"><title>Where to go from here? (everyone!)</title>
+<p>
+There are a number of things this demo doesn't do or doesn't do quite right. For instance, you may
+have noticed that documents in the root context are unreachable (unless you reconfigure Tomcat to
+support that context or redirect to it), anywhere where the directory doesn't quite match the
+context mapping, you'll have a broken link in your results. If you want to index non-local files or
+have some other needs this isn't supported, plus there may be security issues with running the
+indexing application from your webapps directory. There are a number of things left for you the
+developer to do.
+</p>
+<p>
+In time some of these things may be added to Lucene as features (if you've got a good idea we'd love
+to hear it!), but for now: this is where you begin and the search engine/indexer ends. Lastly, one
+would assume you'd want to follow the above advice and customize the application to look a little
+more fancy than black on white with "Lucene Template" at the top. We'll see you on the Lucene
+Users' or Developers' <a href="mailinglists.html">mailing lists</a>!
+</p>
+</section>
+
+<section id="When to contact the Author"><title>When to contact the Author</title>
+<p>
+Please resist the urge to contact the authors of this document (without bribes of fame and fortune
+attached). First contact the <a href="mailinglists.html">mailing lists</a>, taking care to <a
+href="http://www.catb.org/~esr/faqs/smart-questions.html">Ask Questions The Smart Way</a>.
+Certainly you'll get the most help that way as well. That being said, feedback, and modifications
+to this document and samples are ever so greatly appreciated. They are just best sent to the lists
+or <a href="http://wiki.apache.org/jakarta-lucene/HowToContribute">posted as patches</a>, so that
+everyone can share in them. Thanks for understanding!
+</p>
+</section>
+
+</body>
+</document>
+
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml?view=auto&rev=479465
==============================================================================
--- lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml (added)
+++ lucene/java/trunk/src/site/src/documentation/content/xdocs/features.xml Sun Nov 26 16:00:46 2006
@@ -0,0 +1,47 @@
+<?xml version="1.0"?>
+<document>
+<header>
+<title>Apache Lucene - Features</title>
+</header>
+<body>
+
+<section id="Features"><title>Features</title>
+<p>Lucene offers powerful features through a simple API:</p>
+</section>
+
+<section id="Scalable, High-Performance Indexing"><title>Scalable, High-Performance Indexing</title>
+<ul>
+<li>over 20MB/minute on Pentium M 1.5GHz<br/></li>
+<li>small RAM requirements -- only 1MB heap</li>
+<li>incremental indexing as fast as batch indexing</li>
+<li>index size roughly 20-30% the size of text indexed</li>
+</ul>
+</section>
+
+<section id="Powerful, Accurate and Efficient Search Algorithms"><title>Powerful, Accurate and Efficient Search Algorithms</title>
+<ul>
+<li>ranked searching -- best results returned first</li>
+<li>many powerful query types: phrase queries, wildcard queries, proximity
+ queries, range queries and more</li>
+<li>fielded searching (e.g., title, author, contents)</li>
+<li>date-range searching</li>
+<li>sorting by any field</li>
+<li>multiple-index searching with merged results</li>
+<li>allows simultaneous update and searching</li>
+</ul>
+</section>
+
+<section id="Cross-Platform Solution"><title>Cross-Platform Solution</title>
+<ul>
+<li>Available as Open Source software under the
+ <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License</a>
+ which lets you use Lucene in both commercial and Open Source programs</li>
+<li>100%-pure Java</li>
+<li>Implementations <a href="http://wiki.apache.org/jakarta-lucene/LuceneImplementations">in other
+ programming languages available</a> that are index-compatible</li>
+</ul>
+</section>
+
+</body>
+</document>
+
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml?view=auto&rev=479465
==============================================================================
--- lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml (added)
+++ lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml Sun Nov 26 16:00:46 2006
@@ -0,0 +1,1377 @@
+<?xml version="1.0"?>
+
+<document>
+ <header>
+ <title>
+Apache Lucene - Index File Formats
+ </title>
+ </header>
+ <properties>
+
+ <authors>
+ <person email="cutting@apache.org" name="Doug Cutting"/>
+ </authors>
+ </properties>
+
+ <body>
+ <section id="Index File Formats">
+ <title>Index File Formats</title>
+ <p>
+ This document defines the index file formats used
+ in Lucene version 2.0. If you are using a different
+ version of Lucene, please consult the copy of
+ <code>docs/fileformats.html</code> that was distributed
+ with the version you are using.
+ </p>
+
+ <p>
+ Apache Lucene is written in Java, but several
+ efforts are underway to write
+ <a href="http://wiki.apache.org/jakarta-lucene/LuceneImplementations">versions
+ of Lucene in other programming
+ languages</a>. If these versions are to remain compatible with Apache
+ Lucene, then a language-independent definition of the Lucene index
+ format is required. This document thus attempts to provide a
+ complete and independent definition of the Apache Lucene 1.4 file
+ formats.
+ </p>
+
+ <p>
+ As Lucene evolves, this document should evolve.
+ Versions of Lucene in different programming languages should endeavor
+ to agree on file formats, and generate new versions of this document.
+ </p>
+
+ <p>
+ Compatibility notes are provided in this document,
+ describing how file formats have changed from prior versions.
+ </p>
+
+ </section>
+
+ <section id="Definitions">
+ <title>Definitions</title>
+ <p>
+ The fundamental concepts in Lucene are index,
+ document, field and term.
+ </p>
+
+
+ <p>
+ An index contains a sequence of documents.
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ A document is a sequence of fields.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ A field is a named sequence of terms.
+ </p>
+ </li>
+
+ <li>
+ A term is a string.
+ </li>
+ </ul>
+
+ <p>
+ The same string in two different fields is
+ considered a different term. Thus terms are represented as a pair of
+ strings, the first naming the field, and the second naming text
+ within the field.
+ </p>
+
+ <section id="Inverted Indexing">
+ <title>Inverted Indexing</title>
+ <p>
+ The index stores statistics about terms in order
+ to make term-based search more efficient. Lucene's
+ index falls into the family of indexes known as an <i>inverted
+ index.</i> This is because it can list, for a term, the documents that contain
+ it. This is the inverse of the natural relationship, in which
+ documents list terms.
+ </p>
+ </section>
+ <section id="Types of Fields">
+ <title>Types of Fields</title>
+ <p>
+ In Lucene, fields may be <i>stored</i>, in which
+ case their text is stored in the index literally, in a non-inverted
+ manner. Fields that are inverted are called <i>indexed</i>. A field
+ may be both stored and indexed.</p>
+
+ <p>The text of a field may be <i>tokenized</i> into terms to be
+ indexed, or the text of a field may be used literally as a term to be indexed.
+ Most fields are
+ tokenized, but sometimes it is useful for certain identifier fields
+ to be indexed literally.
+ </p>
+ <p>See the <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
+ </section>
+
+ <section id="Segments">
+ <title>Segments</title>
+ <p>
+ Lucene indexes may be composed of multiple sub-indexes, or<i>
+ segments</i>. Each segment is a fully independent index, which could be searched
+ separately. Indexes evolve by:
+ </p>
+
+ <ol>
+ <li><p>Creating new segments for newly added documents.</p>
+ </li>
+ <li><p>Merging existing segments.</p>
+ </li>
+ </ol>
+
+ <p>
+ Searches may involve multiple segments and/or multiple indexes, each
+ index potentially composed of a set of segments.
+ </p>
+ </section>
+
+ <section id="Document Numbers">
+ <title>Document Numbers</title>
+ <p>
+ Internally, Lucene refers to documents by an integer <i>document
+ number</i>. The first document added to an index is numbered zero, and each
+ subsequent document added gets a number one greater than the previous.
+ </p>
+
+ <p>
+ <br/>
+ </p>
+
+ <p>
+ Note that a document's number may change, so caution should be taken
+ when storing these numbers outside of Lucene. In particular, numbers may
+ change in the following situations:
+ </p>
+
+
+ <ul>
+ <li>
+ <p>
+ The
+ numbers stored in each segment are unique only within the segment,
+ and must be converted before they can be used in a larger context.
+ The standard technique is to allocate each segment a range of
+ values, based on the range of numbers used in that segment. To
+ convert a document number from a segment to an external value, the
+ segment's <i>base</i> document
+ number is added. To convert an external value back to a
+ segment-specific value, the segment is identified by the range that
+ the external value is in, and the segment's base value is
+ subtracted. For example two five document segments might be
+ combined, so that the first segment has a base value of zero, and
+ the second of five. Document three from the second segment would
+ have an external value of eight.
+ </p>
+ </li>
+ <li>
+ <p>
+ When documents are deleted, gaps are created
+ in the numbering. These are eventually removed as the index evolves
+ through merging. Deleted documents are dropped when segments are
+ merged. A freshly-merged segment thus has no gaps in its numbering.
+ </p>
+ </li>
+ </ul>
+
+ </section>
+
+ </section>
+
+ <section id="Overview">
+ <title>Overview</title>
+ <p>
+ Each segment index maintains the following:
+ </p>
+ <ul>
+ <li><p>Field names. This
+ contains the set of field names used in the index.
+
+ </p>
+ </li>
+ <li><p>Stored Field
+ values. This contains, for each document, a list of attribute-value
+ pairs, where the attributes are field names. These are used to
+ store auxiliary information about the document, such as its title,
+ url, or an identifier to access a
+ database. The set of stored fields are what is returned for each hit
+ when searching. This is keyed by document number.
+ </p>
+ </li>
+ <li><p>Term dictionary.
+ A dictionary containing all of the terms used in all of the indexed
+ fields of all of the documents. The dictionary also contains the
+ number of documents which contain the term, and pointers to the
+ term's frequency and proximity data.
+ </p>
+ </li>
+
+ <li><p>Term Frequency
+ data. For each term in the dictionary, the numbers of all the
+ documents that contain that term, and the frequency of the term in
+ that document.
+ </p>
+ </li>
+
+ <li><p>Term Proximity
+ data. For each term in the dictionary, the positions that the term
+ occurs in each document.
+ </p>
+ </li>
+
+ <li><p>Normalization
+ factors. For each field in each document, a value is stored that is
+ multiplied into the score for hits on that field.
+ </p>
+ </li>
+ <li><p>Term Vectors. For each field in each document, the term vector
+ (sometimes called document vector) may be stored. A term vector consists
+ of term text and term frequency. To add Term Vectors to your index see the
+ <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a> constructors
+ </p>
+ </li>
+ <li><p>Deleted documents.
+ An optional file indicating which documents are deleted.
+ </p>
+ </li>
+ </ul>
+
+ <p>Details on each of these are provided in subsequent sections.
+ </p>
+ </section>
+
+ <section id="File Naming">
+ <title>File Naming</title>
+ <p>
+ All files belonging to a segment have the same name with varying
+ extensions. The extensions correspond to the different file formats
+ described below. When using the Compound File format (default in 1.4 and greater) these files are
+ collapsed into a single .cfs file (see below for details)
+ </p>
+
+ <p>
+ Typically, all segments
+ in an index are stored in a single directory, although this is not
+ required.
+ </p>
+
+ </section>
+
+ <section id="Primitive Types">
+ <title>Primitive Types</title>
+ <section id="Byte">
+ <title>Byte</title>
+ <p>
+ The most primitive type
+ is an eight-bit byte. Files are accessed as sequences of bytes. All
+ other data types are defined as sequences
+ of bytes, so file formats are byte-order independent.
+ </p>
+
+ </section>
+
+ <section id="UInt32">
+ <title>UInt32</title>
+ <p>
+ 32-bit unsigned integers are written as four
+ bytes, high-order bytes first.
+ </p>
+ <p>
+ UInt32 --> <Byte><sup>4</sup>
+ </p>
+
+ </section>
+
+ <section id="Uint64">
+ <title>Uint64</title>
+ <p>
+ 64-bit unsigned integers are written as eight
+ bytes, high-order bytes first.
+ </p>
+
+ <p>UInt64 --> <Byte><sup>8</sup>
+ </p>
+
+ </section>
+
+ <section id="VInt">
+ <title>VInt</title>
+ <p>
+ A variable-length format for positive integers is
+ defined where the high-order bit of each byte indicates whether more
+ bytes remain to be read. The low-order seven bits are appended as
+ increasingly more significant bits in the resulting integer value.
+ Thus values from zero to 127 may be stored in a single byte, values
+ from 128 to 16,383 may be stored in two bytes, and so on.
+ </p>
+
+ <p><b>VInt Encoding Example</b></p>
+
+ <table width="100%" border="0" cellpadding="4" cellspacing="0">
+ <col width="64*" />
+ <col width="64*" />
+ <col width="64*" />
+ <col width="64*" />
+ <tr valign="TOP">
+ <td width="25%">
+ <p align="RIGHT"><b>Value</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT"><b>First byte</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT"><b>Second byte</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT"><b>Third byte</b>
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="0" sdnum="1033;0;#,##0">
+ <p align="RIGHT">0
+ </p>
+ </td>
+ <td width="25%" sdval="0" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000000
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="1" sdnum="1033;0;#,##0">
+ <p align="RIGHT">1
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="2" sdnum="1033;0;#,##0">
+ <p align="RIGHT">2
+ </p>
+ </td>
+ <td width="25%" sdval="10" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000010
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="127" sdnum="1033;0;#,##0">
+ <p align="RIGHT">127
+ </p>
+ </td>
+ <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 01111111
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="128" sdnum="1033;0;#,##0">
+ <p align="RIGHT">128
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="129" sdnum="1033;0;#,##0">
+ <p align="RIGHT">129
+ </p>
+ </td>
+ <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000001
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="130" sdnum="1033;0;#,##0">
+ <p align="RIGHT">130
+ </p>
+ </td>
+ <td width="25%" sdval="10000010" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000010
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16383" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,383
+ </p>
+ </td>
+ <td width="25%" sdval="11111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 11111111
+ </p>
+ </td>
+ <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 01111111
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm"><br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16384" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,384
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16385" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,385
+ </p>
+ </td>
+ <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000001
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ </table>
+
+ <p>
+ This provides compression while still being
+ efficient to decode.
+ </p>
+
+ </section>
+
+ <section id="Chars">
+ <title>Chars</title>
+ <p>
+ Lucene writes unicode
+ character sequences using Java's
+ <a href="http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8">"modified
+ UTF-8 encoding"</a>.
+ </p>
+
+
+ </section>
+
+ <section id="String">
+ <title>String</title>
+ <p>
+ Lucene writes strings as a VInt representing the length, followed by
+ the character data.
+ </p>
+
+ <p>
+ String --> VInt, Chars
+ </p>
+
+ </section>
+
+ </section>
+
+ <section id="Per-Index Files">
+ <title>Per-Index Files</title>
+ <p>
+ The files in this section exist one-per-index.
+ </p>
+
+ <section id="Segments File">
+ <title>Segments File</title>
+ <p>
+ The active segments in the index are stored in the
+ segment info file. An index only has
+ a single file in this format, and it is named "segments".
+ This lists each segment by name, and also contains the size of each
+ segment.
+ </p>
+
+ <p>
+ Segments --> Format, Version, NameCounter, SegCount, <SegName, SegSize><sup>SegCount</sup>
+ </p>
+
+ <p>
+ Format, NameCounter, SegCount, SegSize --> UInt32
+ </p>
+
+ <p>
+ Version --> UInt64
+ </p>
+
+ <p>
+ SegName --> String
+ </p>
+
+ <p>
+ Format is -1 in Lucene 1.4.
+ </p>
+
+ <p>
+ Version counts how often the index has been
+ changed by adding or deleting documents.
+ </p>
+
+ <p>
+ NameCounter is used to generate names for new segment files.
+ </p>
+
+ <p>
+ SegName is the name of the segment, and is used as the file name prefix
+ for all of the files that compose the segment's index.
+ </p>
+
+ <p>
+ SegSize is the number of documents contained in the segment index.
+ </p>
+
+
+ </section>
+
+ <section id="Lock Files">
+ <title>Lock Files</title>
+ <p>
+ Several files are used to indicate that another
+ process is using an index. Note that these files are not
+ stored in the index directory itself, but rather in the
+ system's temporary directory, as indicated in the Java
+ system property "java.io.tmpdir".
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ When a file named "commit.lock"
+ is present, a process is currently re-writing the "segments"
+ file and deleting outdated segment index files, or a process is
+ reading the "segments"
+ file and opening the files of the segments it names. This lock file
+ prevents files from being deleted by another process after a process
+ has read the "segments"
+ file but before it has managed to open all of the files of the
+ segments named therein.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ When a file named "write.lock"
+ is present, a process is currently adding documents to an index, or
+ removing files from that index. This lock file prevents several
+ processes from attempting to modify an index at the same time.
+ </p>
+ </li>
+ </ul>
+ </section>
+
+ <section id="Deletable File">
+ <title>Deletable File</title>
+ <p>
+ A file named "deletable"
+ contains the names of files that are no longer used by the index, but
+ which could not be deleted. This is only used on Win32, where a
+ file may not be deleted while it is still open. On other platforms
+ the file contains only null bytes.
+ </p>
+
+ <p>
+ Deletable --> DeletableCount,
+ <DelableName><sup>DeletableCount</sup>
+ </p>
+
+ <p>DeletableCount --> UInt32
+ </p>
+ <p>DeletableName -->
+ String
+ </p>
+ </section>
+
+ <section id="Compound Files">
+ <title>Compound Files</title>
+ <p>Starting with Lucene 1.4 the compound file format became default. This
+ is simply a container for all files described in the next section.</p>
+
+ <p>Compound (.cfs) --> FileCount, <DataOffset, FileName><sup>FileCount</sup>,
+ FileData<sup>FileCount</sup></p>
+
+ <p>FileCount --> VInt</p>
+
+ <p>DataOffset --> Long</p>
+
+ <p>FileName --> String</p>
+
+ <p>FileData --> raw file data</p>
+ <p>The raw file data is the data from the individual files named above.</p>
+
+ </section>
+
+ </section>
+
+ <section id="Per-Segment Files">
+ <title>Per-Segment Files</title>
+ <p>
+ The remaining files are all per-segment, and are
+ thus defined by suffix.
+ </p>
+ <section id="Fields">
+ <title>Fields</title>
+ <p><br/><b>Field Info</b><br/></p>
+
+ <p>
+ Field names are
+ stored in the field info file, with suffix .fnm.
+ </p>
+ <p>
+ FieldInfos
+ (.fnm) --> FieldsCount, <FieldName,
+ FieldBits><sup>FieldsCount</sup>
+ </p>
+
+ <p>
+ FieldsCount --> VInt
+ </p>
+
+ <p>
+ FieldName --> String
+ </p>
+
+ <p>
+ FieldBits --> Byte
+ </p>
+
+ <p>
+ <ul>
+ <li>
+ The low-order bit is one for
+ indexed fields, and zero for non-indexed fields.
+ </li>
+ <li>
+ The second lowest-order
+ bit is one for fields that have term vectors stored, and zero for fields
+ without term vectors.
+ </li>
+ <p><b>Lucene >= 1.9:</b></p>
+ <li> If the third lowest-order bit is set (0x04), term positions are stored with the term vectors. </li>
+ <li> If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors. </li>
+ <li> If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field. </li>
+ </ul>
+ </p>
+
+ <p>
+ Fields are numbered by their order in this file. Thus field zero is
+ the
+ first field in the file, field one the next, and so on. Note that,
+ like document numbers, field numbers are segment relative.
+ </p>
+
+ <p><br/><b>Stored Fields</b><br/></p>
+
+ <p>
+ Stored fields are represented by two files:
+ </p>
+
+ <ol>
+ <li>
+ <p>
+ The field index, or .fdx file.
+ </p>
+
+ <p>
+ This contains, for each document, a pointer to
+ its field data, as follows:
+ </p>
+
+ <p>
+ FieldIndex
+ (.fdx) -->
+ <FieldValuesPosition><sup>SegSize</sup>
+ </p>
+ <p>FieldValuesPosition
+ --> Uint64
+ </p>
+ <p>This
+ is used to find the location within the field data file of the
+ fields of a particular document. Because it contains fixed-length
+ data, this file may be easily randomly accessed. The position of
+ document<i> n</i>'s<i> </i>field data is the Uint64 at <i>n*8</i> in
+ this file.
+ </p>
+ </li>
+ <li>
+ <p>
+ The field data, or .fdt file.
+
+ </p>
+
+ <p>
+ This contains the stored fields of each document,
+ as follows:
+ </p>
+
+ <p>
+ FieldData (.fdt) -->
+ <DocFieldData><sup>SegSize</sup>
+ </p>
+ <p>DocFieldData -->
+ FieldCount, <FieldNum, Bits, Value><sup>FieldCount</sup>
+ </p>
+ <p>FieldCount -->
+ VInt
+ </p>
+ <p>FieldNum -->
+ VInt
+ </p>
+
+ <p><b>Lucene <= 1.4:</b></p>
+ <p>Bits -->
+ Byte
+ </p>
+ <p>Value -->
+ String
+ </p>
+ <p>Only the low-order bit of Bits is used. It is one for
+ tokenized fields, and zero for non-tokenized fields.
+ </p>
+ <p><b>Lucene >= 1.9:</b></p>
+ <p>Bits -->
+ Byte
+ </p>
+ <p>
+ <ul>
+ <li>low order bit is one for tokenized fields</li>
+ <li>second bit is one for fields containing binary data</li>
+ <li>third bit is one for fields with compression option enabled
+ (if compression is enabled, the algorithm used is ZLIB)</li>
+ </ul>
+ </p>
+ <p>Value -->
+ String | BinaryValue (depending on Bits)
+ </p>
+ <p>BinaryValue -->
+ ValueSize, <Byte>^ValueSize
+ </p>
+ <p>ValueSize -->
+ VInt
+ </p>
+
+ </li>
+ </ol>
+
+ </section>
+ <section id="Term Dictionary">
+ <title>Term Dictionary</title>
+ <p>
+ The term dictionary is represented as two files:
+ </p>
+ <ol>
+ <li>
+ <p>
+ The term infos, or tis file.
+ </p>
+
+ <p>
+ TermInfoFile (.tis)-->
+ TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
+ </p>
+ <p>TIVersion -->
+ UInt32
+ </p>
+ <p>TermCount -->
+ UInt64
+ </p>
+ <p>IndexInterval -->
+ UInt32
+ </p>
+ <p>SkipInterval -->
+ UInt32
+ </p>
+ <p>TermInfos -->
+ <TermInfo><sup>TermCount</sup>
+ </p>
+ <p>TermInfo -->
+ <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
+ </p>
+ <p>Term -->
+ <PrefixLength, Suffix, FieldNum>
+ </p>
+ <p>Suffix -->
+ String
+ </p>
+ <p>PrefixLength,
+ DocFreq, FreqDelta, ProxDelta, SkipDelta<br/> --> VInt
+ </p>
+ <p>This
+ file is sorted by Term. Terms are ordered first lexicographically
+ by the term's field name, and within that lexicographically by the
+ term's text.
+ </p>
+ <p>TIVersion names the version of the format
+ of this file and is -2 in Lucene 1.4.
+ </p>
+ <p>Term
+ text prefixes are shared. The PrefixLength is the number of initial
+ characters from the previous term which must be pre-pended to a
+ term's suffix in order to form the term's text. Thus, if the
+ previous term's text was "bone" and the term is "boy",
+ the PrefixLength is two and the suffix is "y".
+ </p>
+ <p>FieldNumber
+ determines the term's field, whose name is stored in the .fdt file.
+ </p>
+ <p>DocFreq
+ is the count of documents which contain the term.
+ </p>
+ <p>FreqDelta
+ determines the position of this term's TermFreqs within the .frq
+ file. In particular, it is the difference between the position of
+ this term's data in that file and the position of the previous
+ term's data (or zero, for the first term in the file).
+ </p>
+ <p>ProxDelta
+ determines the position of this term's TermPositions within the .prx
+ file. In particular, it is the difference between the position of
+ this term's data in that file and the position of the previous
+ term's data (or zero, for the first term in the file.
+ </p>
+ <p>SkipDelta determines the position of this
+ term's SkipData within the .frq file. In
+ particular, it is the number of bytes
+ after TermFreqs that the SkipData starts.
+ In other words, it is the length of the
+ TermFreq data.
+ </p>
+ </li>
+ <li>
+ <p>
+ The term info index, or .tii file.
+ </p>
+
+ <p>
+ This contains every IndexInterval<sup>th</sup> entry from the .tis
+ file, along with its location in the "tis" file. This is
+ designed to be read entirely into memory and used to provide random
+ access to the "tis" file.
+ </p>
+
+ <p>
+ The structure of this file is very similar to the
+ .tis file, with the addition of one item per record, the IndexDelta.
+ </p>
+
+ <p>
+ TermInfoIndex (.tii)-->
+ TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices
+ </p>
+ <p>TIVersion -->
+ UInt32
+ </p>
+ <p>IndexTermCount -->
+ UInt64
+ </p>
+ <p>IndexInterval -->
+ UInt32
+ </p>
+ <p>SkipInterval -->
+ UInt32
+ </p>
+ <p>TermIndices -->
+ <TermInfo, IndexDelta><sup>IndexTermCount</sup>
+ </p>
+ <p>IndexDelta -->
+ VLong
+ </p>
+ <p>IndexDelta
+ determines the position of this term's TermInfo within the .tis file. In
+ particular, it is the difference between the position of this term's
+ entry in that file and the position of the previous term's entry.
+ </p>
+ <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
+ Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
+ smaller values result in bigger indexes, less acceleration and more
+ accelerable cases.</p>
+ </li>
+ </ol>
+ </section>
+
+ <section id="Frequencies">
+ <title>Frequencies</title>
+ <p>
+ The .frq file contains the lists of documents
+ which contain each term, along with the frequency of the term in that
+ document.
+ </p>
+ <p>FreqFile (.frq) -->
+ <TermFreqs, SkipData><sup>TermCount</sup>
+ </p>
+ <p>TermFreqs -->
+ <TermFreq><sup>DocFreq</sup>
+ </p>
+ <p>TermFreq -->
+ DocDelta, Freq?
+ </p>
+ <p>SkipData -->
+ <SkipDatum><sup>DocFreq/SkipInterval</sup>
+ </p>
+ <p>SkipDatum -->
+ DocSkip,FreqSkip,ProxSkip
+ </p>
+ <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip -->
+ VInt
+ </p>
+ <p>TermFreqs
+ are ordered by term (the term is implicit, from the .tis file).
+ </p>
+ <p>TermFreq
+ entries are ordered by increasing document number.
+ </p>
+ <p>DocDelta
+ determines both the document number and the frequency. In
+ particular, DocDelta/2 is the difference between this document number
+ and the previous document number (or zero when this is the first
+ document in a TermFreqs). When DocDelta is odd, the frequency is
+ one. When DocDelta is even, the frequency is read as another VInt.
+ </p>
+ <p>For
+ example, the TermFreqs for a term which occurs once in document seven
+ and three times in document eleven would be the following sequence of
+ VInts:
+ </p>
+ <p> 15,
+ 8, 3
+ </p>
+ <p>DocSkip records the document number before every
+ SkipInterval<sup>th</sup> document in TermFreqs.
+ Document numbers are represented as differences
+ from the previous value in the sequence. FreqSkip
+ and ProxSkip record the position of every
+ SkipInterval<sup>th</sup> entry in FreqFile and
+ ProxFile, respectively. File positions are
+ relative to the start of TermFreqs and Positions,
+ to the previous SkipDatum in the sequence.
+ </p>
+ <p>For example, if DocFreq=35 and SkipInterval=16,
+ then there are two SkipData entries, containing
+ the 15<sup>th</sup> and 31<sup>st</sup> document
+ numbers in TermFreqs. The first FreqSkip names
+ the number of bytes after the beginning of
+ TermFreqs that the 16<sup>th</sup> SkipDatum
+ starts, and the second the number of bytes after
+ that that the 32<sup>nd</sup> starts. The first
+ ProxSkip names the number of bytes after the
+ beginning of Positions that the 16<sup>th</sup>
+ SkipDatum starts, and the second the number of
+ bytes after that that the 32<sup>nd</sup> starts.
+ </p>
+
+ </section>
+ <section id="Positions">
+ <title>Positions</title>
+ <p>
+ The .prx file contains the lists of positions that
+ each term occurs at within documents.
+ </p>
+ <p>ProxFile (.prx) -->
+ <TermPositions><sup>TermCount</sup>
+ </p>
+ <p>TermPositions -->
+ <Positions><sup>DocFreq</sup>
+ </p>
+ <p>Positions -->
+ <PositionDelta><sup>Freq</sup>
+ </p>
+ <p>PositionDelta -->
+ VInt
+ </p>
+ <p>TermPositions
+ are ordered by term (the term is implicit, from the .tis file).
+ </p>
+ <p>Positions
+ entries are ordered by increasing document number (the document
+ number is implicit from the .frq file).
+ </p>
+ <p>PositionDelta
+ is the difference between the position of the current occurrence in
+ the document and the previous occurrence (or zero, if this is the
+ first occurrence in this document).
+ </p>
+ <p>
+ For example, the TermPositions for a
+ term which occurs as the fourth term in one document, and as the
+ fifth and ninth term in a subsequent document, would be the following
+ sequence of VInts:
+ </p>
+ <p> 4,
+ 5, 4
+ </p>
+ </section>
+ <section id="Normalization Factors">
+ <title>Normalization Factors</title>
+ <p>There's a norm file for each indexed field with a byte for
+ each document. The .f[0-9]* file contains,
+ for each document, a byte that encodes a value that is multiplied
+ into the score for hits on that field:
+ </p>
+ <p>Norms
+ (.f[0-9]*) --> <Byte><sup>SegSize</sup>
+ </p>
+ <p>Each
+ byte encodes a floating point value. Bits 0-2 contain the 3-bit
+ mantissa, and bits 3-8 contain the 5-bit exponent.
+ </p>
+ <p>These
+ are converted to an IEEE single float value as follows:
+ </p>
+ <ol>
+ <li><p>If
+ the byte is zero, use a zero float.
+ </p>
+ </li>
+ <li><p>Otherwise,
+ set the sign bit of the float to zero;
+ </p>
+ </li>
+ <li><p>add
+ 48 to the exponent and use this as the float's exponent;
+ </p>
+ </li>
+ <li><p>map
+ the mantissa to the high-order 3 bits of the float's mantissa; and
+
+ </p>
+ </li>
+ <li><p>set
+ the low-order 21 bits of the float's mantissa to zero.
+ </p>
+ </li>
+ </ol>
+
+ </section>
+ <section id="Term Vectors">
+ <title>Term Vectors</title>
+ Term Vector support is an optional on a field by field basis. It consists of 4
+ files.
+ <ol>
+ <li>
+ <p>The Document Index or .tvx file.</p>
+ <p>This contains, for each document, a pointer to the document data in the Document
+ (.tvd) file.
+ </p>
+ <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition><sup>NumDocs</sup></p>
+ <p>TVXVersion --> Int</p>
+ <p>DocumentPosition --> UInt64</p>
+ <p>This is used to find the position of the Document in the .tvd file.</p>
+ </li>
+ <li>
+ <p>The Document or .tvd file.</p>
+ <p>This contains, for each document, the number of fields, a list of the fields with
+ term vector info and finally a list of pointers to the field information in the .tvf
+ (Term Vector Fields) file.</p>
+ <p>
+ Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions,><sup>NumDocs</sup>
+ </p>
+ <p>TVDVersion --> Int</p>
+ <p>NumFields --> VInt</p>
+ <p>FieldNums --> <FieldNumDelta><sup>NumFields</sup></p>
+ <p>FieldNumDelta --> VInt</p>
+ <p>FieldPositions --> <FieldPosition><sup>NumFields</sup></p>
+ <p>FieldPosition --> VLong</p>
+ <p>The .tvd file is used to map out the fields that have term vectors stored and
+ where the field information is in the .tvf file.</p>
+ </li>
+ <li>
+ <p>The Field or .tvf file.</p>
+ <p>This file contains, for each field that has a term vector stored, a list of
+ the terms and their frequencies.</p>
+ <p>Field (.tvf) --> TVFVersion<NumTerms, NumDistinct, TermFreqs><sup>NumFields</sup></p>
+ <p>TVFVersion --> Int</p>
+ <p>NumTerms --> VInt</p>
+ <p>NumDistinct --> VInt -- Future Use</p>
+ <p>TermFreqs --> <TermText, TermFreq><sup>NumTerms</sup></p>
+ <p>TermText --> <PrefixLength, Suffix></p>
+ <p>PrefixLength --> VInt</p>
+ <p>Suffix --> String</p>
+ <p>TermFreq --> VInt</p>
+ <p>Term
+ text prefixes are shared. The PrefixLength is the number of initial
+ characters from the previous term which must be pre-pended to a
+ term's suffix in order to form the term's text. Thus, if the
+ previous term's text was "bone" and the term is "boy",
+ the PrefixLength is two and the suffix is "y".
+ </p>
+ </li>
+ </ol>
+ </section>
+
+ <section id="Deleted Documents">
+ <title>Deleted Documents</title>
+
+ <p>The .del file is
+ optional, and only exists when a segment contains deletions:
+ </p>
+
+ <p>Deletions
+ (.del) --> ByteCount,BitCount,Bits
+ </p>
+
+ <p>ByteSize,BitCount -->
+ Uint32
+ </p>
+
+ <p>Bits -->
+ <Byte><sup>ByteCount</sup>
+ </p>
+
+ <p>ByteCount
+ indicates the number of bytes in Bits. It is typically
+ (SegSize/8)+1.
+ </p>
+
+ <p>
+ BitCount
+ indicates the number of bits that are currently set in Bits.
+ </p>
+
+ <p>Bits
+ contains one bit for each document indexed. When the bit
+ corresponding to a document number is set, that document is marked as
+ deleted. Bit ordering is from least to most significant. Thus, if
+ Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
+ deleted.
+ </p>
+ </section>
+ </section>
+
+ <section id="Limitations">
+ <title>Limitations</title>
+ <p>There
+ are a few places where these file formats limit the maximum number of
+ terms and documents to a 32-bit quantity, or to approximately 4
+ billion. This is not today a problem, but, in the long term,
+ probably will be. These should therefore be replaced with either
+ UInt64 values, or better yet, with VInt values which have no limit.
+ </p>
+
+ </section>
+
+ </body>
+
+</document>
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml?view=auto&rev=479465
==============================================================================
--- lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml (added)
+++ lucene/java/trunk/src/site/src/documentation/content/xdocs/gettingstarted.xml Sun Nov 26 16:00:46 2006
@@ -0,0 +1,55 @@
+<?xml version="1.0"?>
+<document>
+ <header>
+ <title>
+ Apache Lucene - Getting Started Guide
+ </title>
+ </header>
+<properties>
+<author email="acoliver@apache.org">Andrew C. Oliver</author>
+</properties>
+<body>
+
+<section id="Getting Started">
+ <title>Getting Started</title>
+<p>
+This document is intended as a "getting started" guide. It has three audiences: first-time users
+looking to install Apache Lucene in their application or web server; developers looking to modify or base
+the applications they develop on Lucene; and developers looking to become involved in and contribute
+to the development of Lucene. This document is written in tutorial and walk-through format. The
+goal is to help you "get started". It does not go into great depth on some of the conceptual or
+inner details of Lucene.
+</p>
+
+<p>
+Each section listed below builds on one another. More advanced users
+may wish to skip sections.
+</p>
+
+<ul>
+ <li><a href="demo.html">About the command-line Lucene demo and its usage</a>. This section
+ is intended for anyone who wants to use the command-line Lucene demo.</li> <p/>
+
+ <li><a href="demo2.html">About the sources and implementation for the command-line Lucene
+ demo</a>. This section walks through the implementation details (sources) of the
+ command-line Lucene demo. This section is intended for developers.</li> <p/>
+
+ <li><a href="demo3.html">About installing and configuring the demo template web
+ application</a>. While this walk-through assumes Tomcat as your container of choice,
+ there is no reason you can't (provided you have the requisite knowledge) adapt the
+ instructions to your container. This section is intended for those responsible for the
+ development or deployment of Lucene-based web applications.</li> <p/>
+
+ <li><a href="demo4.html">About the sources used to construct the demo template web
+ application</a>. Please note the template application is designed to highlight features of
+ Lucene and is <b>not</b> an example of best practices. (One would hopefully use MVC
+ architecture such as provided by Jakarta Struts and taglibs, but showing you how to do that
+ would be WAY beyond the scope of this guide.) This section is intended for developers and
+ those wishing to customize the demo template web application to their needs. </li>
+
+</ul>
+</section>
+
+</body>
+</document>
+
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/asf-logo.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico
------------------------------------------------------------------------------
svn:executable = *
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/favicon.ico
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_architecture.jpg
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/larm_crawling-process.jpg
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lia_3d.jpg
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_100.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_150.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_200.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_250.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_green_300.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_100.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_150.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_200.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_250.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif?view=auto&rev=479465
==============================================================================
Binary file - no diff available.
Propchange: lucene/java/trunk/src/site/src/documentation/content/xdocs/images/lucene_outline_300.gif
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream