You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Martin Sevigny <se...@ajlsm.com> on 2002/09/06 09:54:40 UTC

RE : Reading terms performance

Hi,

Dmitry Serebrennikov [mailto:dmitrys@earthlink.net] wrote

> If you use the method IndexReader.terms(Term startAt) the enumeration
> will start with the term equal or greater than the one supplied. The 
> terms are ordered by field + text, so all terms of a given field come 
> together. If you create your initial term with the field you are 
> interested in and a "" for text, you will start enumeration with the 
> first term of that field. Now, just go through the enum 
> calling next() 
> until the returned term has a field other then you are interested in. 
> Field names are interned (see String.intern()), so they can 
> be compared 
> with == instead of .equals(). This speeds things up a lot.

I see. I had seen this method, but I hadn't think of using "" as a
value, neither the == trick. I'll test that. Thanks.

> Yes, it is significant for searching. However, if you do not
> run queries 
> against a given field, but just want to use it as a dictionary, the 
> terms can have any form. For example "sortprefix:value", so that they 
> sort correctly and yet actual values can be extracted.

...

> If you are doing things the way described above, I don't know of any
> other ways to up the speed. You may want to store the terms in a 
> different way, where they would be compressed and take up 
> less space on 
> disk, thereby causing less disk IO. Perhaps you can use Lucene to 
> extract and alphabetize the terms, and then transfer them 
> into another 
> file for faster access. How large is you document base? How 
> many unique 
> terms in the target field? In my experience term access is 
> quite fast...

Well we have a generic XML search engine built with Lucene, so these
numbers vary from applications to applications. I think it's mostly
sorting which is slow (along with your previous remark). So may be they
should be stored elsewhere correctly sorted.

Thank's a lot for your feedback,

Martin Sévigny


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>