You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Glen Newton <gl...@gmail.com> on 2008/04/15 18:25:23 UTC

Lucene performance: benchmarktemplate.xml

<benchmark>
  <ul>
  <p>
  <b>Hardware Environment</b><br/>
  <li><i>Dedicated machine for indexing</i>: yes</li>
  <li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
  <li><i>RAM</i>: 8GB</li>
  <li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
  </p>
  <p>
  <b>Software environment</b><br/>
  <li><i>Lucene Version</i>: 2.3.1</li>
  <li><i>Java Version</i>:  Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
  <li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
  <li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
  <li><i>Location of index</i>: Filesystem, on attached storage</li>
  </p>
  <p>
  <b>Lucene indexing variables</b><br/>
  <li><i>Number of source documents</i>: 6,404,464</li>
  <li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
  <li><i>Average filesize of source documents</i>:
22KB + metadata (see above)</li>
  <li><i>Source documents storage location</i>: Where are the documents
being indexed located?
    Filesystem</li>
  <li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
  <li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
  <li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
  <li><i>Number of fields per document</i>: 24</li>
  <li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
  <li><i>Index persistence</i>: FSDirectory</li>
  <li><i>Index size</i>: 83GB</li>
  <li><i>Number of terms</i>: 143,298,010</li>
  </p>
  <p>
  <b>Figures</b><br/>
  <li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
  <li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
  <li><i>Memory consumption</i>:  -Xms4000m  -Xmx6000m</li>
  <li><i>Query speed</i>: average time a query takes, type
    of queries (e.g. simple one-term query, phrase query),
    not measuring any overhead outside Lucene</li>
  </p>
  <p>
  <b>Notes</b><br/>
  <li><i>Notes</i>:
        <ul>
	  <li>
	    These are journal articles, so the additional fields besides the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
	  </li>
	  <li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC   -server  -Xms4000m
-Xmx6000m
	  </li>
	  <li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
	  </li>
	  <li>File system file reading and Un-gzip performed multithreaded
	  </li>
	  <li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
	  </li>
	  <li>
	    Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
	  </li>
	
	</ul>
</li>
  </p>
  </ul>
</benchmark>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org