You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Edward Summers <eh...@pobox.com> on 2005/09/20 04:59:31 UTC
large index sizes
I'm investigating possible alternatives for indexing/searching a very
large dataset (2TB) of xml data from the pubmed database[1]. Does
anyone have any experience working with indexes of this size? Granted
the actual index size would be smaller than the source files, but I'm
just curious how big the largest known lucene indexes are, and what
sort of hardware they run on...assuming they're not behind closed
doors at the Dept of Homeland Security ;-)
//Ed
[1] http://www.ncbi.nlm.nih.gov/entrez
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: large index sizes
Posted by Richard Littin <ri...@reeltwo.com>.
Hi Edward,
We have indexed the MedLine data. We used the default StopAnalyzer on
the full text fields (fields that are more than just dates or ids) and
the default Keyword for the other fields. So the index has the short
fields stored in it and just indexing for the larger fields. In our
application we store the MedLine text in a MySQL database and link
between the Lucene index and database via a synthetic key generated by
the database and stored in the lucene index. The sizes of things on
disk are:
lucene index: 6.9GB
mysql db: 30.0 GB
The original compressed xml data is about 7.5GB. We also annotate the
MedLine data and store that in Lucene index and the database. The
annotations would be less that 20% of the of the index. Also I'm not
sure that we index/store all the information available in the MedLine xml.
We run a search web service that uses the Lucene index for searching and
the database for data retrieval. This runs on a fairly standard 3.0GHz
P4 with 2GB memory (will easially run in 1G of memory but you go with
what you've got) and has about 1T of disk (raided across four 400GB
disks). The machine is a linux box, RedHat Fedora Core 3 and the web
services are served up via jakarta tomcat (all code is Java). A similar
spec'd machine is used to build the index and database, although we tend
to use 64bit AMDs as they allow access to more memory (bigger Lucene
RAMDirectorys). Takes about 2 days to build the database and lucene index.
We have another dataset which has a 27GB lucene index and well over
100GB in the database. The only issue we have had with Lucene is the
speed at which you can access data from these large indexes. By this I
mean obtaining the synthetic key that links the Lucene index to the
database. We got around this issue by creating a separate array of
synthetic keys (they are ints) that is indexed by lucene document id.
This array is persisted as a file on disk and uses java.nio to memory
map it.
Hope this is of some help.
Richard
> I'm investigating possible alternatives for indexing/searching a very
> large dataset (2TB) of xml data from the pubmed database[1]. Does
> anyone have any experience working with indexes of this size? Granted
> the actual index size would be smaller than the source files, but I'm
> just curious how big the largest known lucene indexes are, and what
> sort of hardware they run on...assuming they're not behind closed
> doors at the Dept of Homeland Security ;-)
>
> //Ed
>
> [1] http://www.ncbi.nlm.nih.gov/entrez
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Richard Littin, Reel Two NZ, PO Box 1538, Hamilton, New Zealand
richard@reeltwo.com - ph: +64 7 857 0703 - fax: +64 7 857 0701
void main(){float x=0;while(putchar((long)((x<4)?(((6.6667*x-34.5)
*x+50.8333)*x+++82):(((-7.5*x+97)*x-398.5)*x+++619)))-10);}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org