You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Martin Sevigny <se...@ajlsm.com> on 2002/09/06 09:54:40 UTC
RE : Reading terms performance
Hi,
Dmitry Serebrennikov [mailto:dmitrys@earthlink.net] wrote
> If you use the method IndexReader.terms(Term startAt) the enumeration
> will start with the term equal or greater than the one supplied. The
> terms are ordered by field + text, so all terms of a given field come
> together. If you create your initial term with the field you are
> interested in and a "" for text, you will start enumeration with the
> first term of that field. Now, just go through the enum
> calling next()
> until the returned term has a field other then you are interested in.
> Field names are interned (see String.intern()), so they can
> be compared
> with == instead of .equals(). This speeds things up a lot.
I see. I had seen this method, but I hadn't think of using "" as a
value, neither the == trick. I'll test that. Thanks.
> Yes, it is significant for searching. However, if you do not
> run queries
> against a given field, but just want to use it as a dictionary, the
> terms can have any form. For example "sortprefix:value", so that they
> sort correctly and yet actual values can be extracted.
...
> If you are doing things the way described above, I don't know of any
> other ways to up the speed. You may want to store the terms in a
> different way, where they would be compressed and take up
> less space on
> disk, thereby causing less disk IO. Perhaps you can use Lucene to
> extract and alphabetize the terms, and then transfer them
> into another
> file for faster access. How large is you document base? How
> many unique
> terms in the target field? In my experience term access is
> quite fast...
Well we have a generic XML search engine built with Lucene, so these
numbers vary from applications to applications. I think it's mostly
sorting which is slow (along with your previous remark). So may be they
should be stored elsewhere correctly sorted.
Thank's a lot for your feedback,
Martin Sévigny
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>