You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Martin Sevigny <se...@ajlsm.com> on 2002/09/05 16:58:54 UTC

Reading terms performance

Lucene developers,

If an application using Lucene wants to read the list of values for a
field, it must use (I think) the IndexReader.terms() method. But this
method is costly, because it returns all values for all fields, although
we could want only the values of a field.

Are there any tricks here to increase performance? Are there any plans?
For instance, all field values are stored in a single file for a segment
(.tis). May be splitting the values in a specifica file per field would
make it work better?

The other thing I was wondering is the sorting of these terms. They are
retrieved in the order according to Java's compareTo() method. It means
that they are sometimes in alphabetical order (english or english-like
languages), but not always. Is this ordering really significant in the
internals of Lucene? Or is it just there for convenience to the
application developer?

I'm asking because we have an application that make los of use of these
list of terms, for non-english values, and performance in reading the
values and resorting them is a problem right now.

Thank's for any clues,

Martin Sévigny


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


RE : Reading terms performance

Posted by Martin Sevigny <se...@ajlsm.com>.
Hi,

Dmitry Serebrennikov [mailto:dmitrys@earthlink.net] wrote

> If you use the method IndexReader.terms(Term startAt) the enumeration
> will start with the term equal or greater than the one supplied. The 
> terms are ordered by field + text, so all terms of a given field come 
> together. If you create your initial term with the field you are 
> interested in and a "" for text, you will start enumeration with the 
> first term of that field. Now, just go through the enum 
> calling next() 
> until the returned term has a field other then you are interested in. 
> Field names are interned (see String.intern()), so they can 
> be compared 
> with == instead of .equals(). This speeds things up a lot.

I see. I had seen this method, but I hadn't think of using "" as a
value, neither the == trick. I'll test that. Thanks.

> Yes, it is significant for searching. However, if you do not
> run queries 
> against a given field, but just want to use it as a dictionary, the 
> terms can have any form. For example "sortprefix:value", so that they 
> sort correctly and yet actual values can be extracted.

...

> If you are doing things the way described above, I don't know of any
> other ways to up the speed. You may want to store the terms in a 
> different way, where they would be compressed and take up 
> less space on 
> disk, thereby causing less disk IO. Perhaps you can use Lucene to 
> extract and alphabetize the terms, and then transfer them 
> into another 
> file for faster access. How large is you document base? How 
> many unique 
> terms in the target field? In my experience term access is 
> quite fast...

Well we have a generic XML search engine built with Lucene, so these
numbers vary from applications to applications. I think it's mostly
sorting which is slow (along with your previous remark). So may be they
should be stored elsewhere correctly sorted.

Thank's a lot for your feedback,

Martin Sévigny


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Reading terms performance

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Martin Sevigny wrote:

>Lucene developers,
>
>If an application using Lucene wants to read the list of values for a
>field, it must use (I think) the IndexReader.terms() method. But this
>method is costly, because it returns all values for all fields, although
>we could want only the values of a field.
>
If you use the method IndexReader.terms(Term startAt) the enumeration 
will start with the term equal or greater than the one supplied. The 
terms are ordered by field + text, so all terms of a given field come 
together. If you create your initial term with the field you are 
interested in and a "" for text, you will start enumeration with the 
first term of that field. Now, just go through the enum calling next() 
until the returned term has a field other then you are interested in. 
Field names are interned (see String.intern()), so they can be compared 
with == instead of .equals(). This speeds things up a lot.

TermEnums are efficient in that they skip into the term enumeration 
quickly (using an in-memory index of all terms in a given segment, which 
are stored on disk). Also, the TermEnum will read ahead as appropriate 
so that you don't read (much) more than you have to.

Finally, the .terms(Term) differs from the .terms() method in one tricky 
way that can bite you if you are not careful. The TermEnum that is 
returned from .terms() method is positioned *before* the first term, so 
that next() must be called before you can use the enumeration.

However, the TermEnum returned from the terms(Term) method is positioned 
*at* the starting term (which is greater or equal to the term supplied). 
That means that you should start processing before first, and call the 
next() later.

>
>Are there any tricks here to increase performance? Are there any plans?
>For instance, all field values are stored in a single file for a segment
>(.tis). May be splitting the values in a specifica file per field would
>make it work better?
>
>The other thing I was wondering is the sorting of these terms. They are
>retrieved in the order according to Java's compareTo() method. It means
>that they are sometimes in alphabetical order (english or english-like
>languages), but not always. Is this ordering really significant in the
>internals of Lucene? Or is it just there for convenience to the
>application developer?
>
Yes, it is significant for searching. However, if you do not run queries 
against a given field, but just want to use it as a dictionary, the 
terms can have any form. For example "sortprefix:value", so that they 
sort correctly and yet actual values can be extracted.

>
>I'm asking because we have an application that make los of use of these
>list of terms, for non-english values, and performance in reading the
>values and resorting them is a problem right now.
>
If you are doing things the way described above, I don't know of any 
other ways to up the speed. You may want to store the terms in a 
different way, where they would be compressed and take up less space on 
disk, thereby causing less disk IO. Perhaps you can use Lucene to 
extract and alphabetize the terms, and then transfer them into another 
file for faster access. How large is you document base? How many unique 
terms in the target field? In my experience term access is quite fast...

Also, check the IO speed to the disk you are storing this on. In my 
experience, a slow disk or a slower bus to that disk can slow things 
down by as much as 10 to 20 times!

Good luck.
Dmitry.

>
>Thank's for any clues,
>
>Martin Sévigny
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>