You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Edward Summers <eh...@pobox.com> on 2005/09/20 04:59:31 UTC

large index sizes

I'm investigating possible alternatives for indexing/searching a very  
large dataset (2TB) of xml data from the pubmed  database[1]. Does  
anyone have any experience working with indexes of this size? Granted  
the actual index size would be smaller than the source files, but I'm  
just curious how big the largest known lucene indexes are, and what  
sort of hardware they run on...assuming they're not behind closed  
doors at the Dept of Homeland Security ;-)

//Ed

[1] http://www.ncbi.nlm.nih.gov/entrez

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: large index sizes

Posted by Richard Littin <ri...@reeltwo.com>.

Hi Edward,

We have indexed the MedLine data.  We used the default StopAnalyzer on 
the full text fields (fields that are more than just dates or ids) and 
the default Keyword for the other fields.  So the index has the short 
fields stored in it and just indexing for the larger fields.  In our 
application we store the MedLine text in a MySQL database and link 
between the Lucene index and database via a synthetic key generated by 
the database and stored in the lucene index.  The sizes of things on 
disk are:

lucene index: 6.9GB
mysql db: 30.0 GB

The original compressed xml data is about 7.5GB.  We also annotate the 
MedLine data and store that in Lucene index and the database.  The 
annotations would be less that 20% of the of the index.  Also I'm not 
sure that we index/store all the information available in the MedLine xml.

We run a search web service that uses the Lucene index for searching and 
the database for data retrieval.  This runs on a fairly standard 3.0GHz 
P4 with 2GB memory (will easially run in 1G of memory but you go with 
what you've got) and has about 1T of disk (raided across four 400GB 
disks).  The machine is a linux box, RedHat Fedora Core 3 and the web 
services are served up via jakarta tomcat (all code is Java).  A similar 
spec'd machine is used to build the index and database, although we tend 
to use 64bit AMDs as they allow access to more memory (bigger Lucene 
RAMDirectorys).  Takes about 2 days to build the database and lucene index.

We have another dataset which has a 27GB lucene index and well over 
100GB in the database.  The only issue we have had with Lucene is the 
speed at which you can access data from these large indexes.  By this I 
mean obtaining the synthetic key that links the Lucene index to the 
database.  We got around this issue by creating a separate array of 
synthetic keys (they are ints) that is indexed by lucene document id.  
This array is persisted as a file on disk and uses java.nio to memory 
map it.

Hope this is of some help.

Richard

> I'm investigating possible alternatives for indexing/searching a very  
> large dataset (2TB) of xml data from the pubmed  database[1]. Does  
> anyone have any experience working with indexes of this size? Granted  
> the actual index size would be smaller than the source files, but I'm  
> just curious how big the largest known lucene indexes are, and what  
> sort of hardware they run on...assuming they're not behind closed  
> doors at the Dept of Homeland Security ;-)
>
> //Ed
>
> [1] http://www.ncbi.nlm.nih.gov/entrez
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Richard Littin, Reel Two NZ, PO Box 1538, Hamilton, New Zealand
richard@reeltwo.com - ph: +64 7 857 0703 - fax: +64 7 857 0701
void main(){float x=0;while(putchar((long)((x<4)?(((6.6667*x-34.5)
*x+50.8333)*x+++82):(((-7.5*x+97)*x-398.5)*x+++619)))-10);}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org