You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2003/12/04 19:28:58 UTC

Re: suggestion for a CustomDirectory

Julien Nioche wrote:
> However in most cases the
> application would be faster because :
> - tree access to the Term (this is only the case for the Terms in the .tii)
> - no need to create up to 127 temporary Term objects (with creation of
> Strings and so on....)
> - limit garbage collecting

The .tii is already read into memory when the index is opened.  So the 
only savings would be the creation of (on average) 64 temporary Term 
objects per query.  Do you have any evidence that this is a substantial 
part of the computation?  I'd be surprised if it was.  To find out, you 
could write a program which compares the time it takes to call docFreq() 
on a set of terms (allocating the 64 temporary Terms) to what it takes 
to perform queries (doing the rest of the work).  I'll bet that the 
first is substantially faster: most of the work of executing a query is 
processing the .frq and .prx files.  These are bigger than the RAM on 
your machine, and so cannot be cached.  Thus you'll always be doing some 
disk i/o, which will likely dominate real performance.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: suggestion for a CustomDirectory

Posted by Doug Cutting <cu...@lucene.com>.

Julien Nioche wrote:
> Profiling my application indicates that a lot of times is spent for the
> creation of temporary Term objects.

It does indeed look like term lookup is using a lot of your time.  I 
don't see the Term constructor showing up as significant in your 
profile, so it looks to me like it could just the cost of parsing the 
data, not the allocation/GC stuff.  I've found that allocation of 
temporary objects doesn't really cost much with modern garbage 
collectors.  The biggest cost of allocating objects is sometimes just 
the constructor.

What sort of queries are you making against what sort of an index?  It 
looks like you're probably making large queries with lots of 
low-frequency terms, in order for term lookup to be such a large factor. 
  You might try sorting the terms in the query.  If subsequent lookups 
are nearby in the TermInfo file then it won't have to scan as much. 
Could that help?  Also, is your index optimized?  An optimized index 
will drastically reduce the term lookup costs.

If all these fail, try reducing TermInfosWriter.INDEX_INTERVAL.  You'll 
have to re-create your indexes each time you change this constant.  You 
might try a value like 16.  This would keep the number of terms in 
memory from being too huge (1 of 16 terms), but would reduce the average 
number scanned from 64 to 8, which would be substantial.  Tell me how 
this works.  If it makes a big difference, then perhaps we should make 
this parameter more easily changable.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: suggestion for a CustomDirectory

Posted by Julien Nioche <Ju...@lingway.com>.

Thank you for your answer Doug

Profiling my application indicates that a lot of times is spent for the
creation of temporary Term objects.

This is at least true for PhraseQueries weighting as shown on the profiling
figures below :

.41.2% - 473240 ms - 2802 inv.
org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer
..40.4% - 464202 ms - 7440 inv.
org.apache.lucene.index.IndexReader.termPositions
...40.1% - 460378 ms - 7440 inv.
org.apache.lucene.index.SegmentTermDocs.seek
....40.0% - 459297 ms - 7440 inv.
org.apache.lucene.index.TermInfosReader.get
.....39.1% - 448370 ms - 7440 inv.
org.apache.lucene.index.TermInfosReader.scanEnum
.......34.4% - 394578 ms - 484790 inv.
org.apache.lucene.index.SegmentTermEnum.next
.........25.8% - 296435 ms - 484790 inv.
org.apache.lucene.index.SegmentTermEnum.readTerm
.........3.5% - 40565 ms - 969580 inv.
org.apache.lucene.store.InputStream.readVLong
.........1.8% - 21147 ms - 484790 inv.
org.apache.lucene.store.InputStream.readVInt

This is only method time, it doesn't take into account the time required for
garbage collecting all those temporary objects.

I'll test other applications I made to confirm this.

>> Scott,

I tried NIODirectory and provided some benchmarks for it on the list with my
apps. It improves a little bit the overall performances but it could be
interesting if we could choose the files we want to map into memory.

----- Original Message -----
From: "Doug Cutting" <cu...@lucene.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Thursday, December 04, 2003 7:28 PM
Subject: Re: suggestion for a CustomDirectory


> Julien Nioche wrote:
> > However in most cases the
> > application would be faster because :
> > - tree access to the Term (this is only the case for the Terms in the
.tii)
> > - no need to create up to 127 temporary Term objects (with creation of
> > Strings and so on....)
> > - limit garbage collecting
>
> The .tii is already read into memory when the index is opened.  So the
> only savings would be the creation of (on average) 64 temporary Term
> objects per query.  Do you have any evidence that this is a substantial
> part of the computation?  I'd be surprised if it was.  To find out, you
> could write a program which compares the time it takes to call docFreq()
> on a set of terms (allocating the 64 temporary Terms) to what it takes
> to perform queries (doing the rest of the work).  I'll bet that the
> first is substantially faster: most of the work of executing a query is
> processing the .frq and .prx files.  These are bigger than the RAM on
> your machine, and so cannot be cached.  Thus you'll always be doing some
> disk i/o, which will likely dominate real performance.
>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org