You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Patterson <de...@hotmail.com> on 2004/07/26 21:41:21 UTC

Caching of TermDocs

Is there any way to cache TermDocs?  Is this a good idea?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Caching of TermDocs

Posted by John Patterson <de...@hotmail.com>.

Cool.  I'll give it a try.  Looks like extending FilterIndexReader is the
way to go.  Or possibly I could cache the compressed form at a lower level
getting the best of both worlds.  I'll look into both ways, profile the app,
and post my results.

----- Original Message ----- 
From: "Doug Cutting" <cu...@apache.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, July 27, 2004 8:33 PM
Subject: Re: Caching of TermDocs


> John Patterson wrote:
> > I would like to hold a significant amount of the index in memory but use
the
> > disk index as a spill over.  Obviously the best situation is to hold in
> > memory only the information that is likely to be used again soon.  It
seems
> > that caching TermDocs would allow popular search terms to be searched
more
> > efficiently while the less common terms would need to be read from disk.
>
> The operating system already caches recent disk i/o.  So what you'd save
> primarily would be the overhead of parsing the data.  However the parsed
> form, a sequence of docNo and freq ints, is nearly eight times as large
> as its compressed size in the index.  So your cache would consume a lot
> of memory.
>
> Whether it this provide much overall speedup depends on the distribution
> of common terms in your query traffic.  If you have a few terms that are
> searched very frequently then it might pay off.  In my experience with
> general-purpose search engines this is not usually the case: folks seem
> to use rarer words in queries than they do in ordinary text.  But in
> some search applications perhaps the traffic is more skewed.  Only some
> experiments would tell for sure.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Caching of TermDocs

Posted by Doug Cutting <cu...@apache.org>.

John Patterson wrote:
> I would like to hold a significant amount of the index in memory but use the
> disk index as a spill over.  Obviously the best situation is to hold in
> memory only the information that is likely to be used again soon.  It seems
> that caching TermDocs would allow popular search terms to be searched more
> efficiently while the less common terms would need to be read from disk.

The operating system already caches recent disk i/o.  So what you'd save 
primarily would be the overhead of parsing the data.  However the parsed 
form, a sequence of docNo and freq ints, is nearly eight times as large 
as its compressed size in the index.  So your cache would consume a lot 
of memory.

Whether it this provide much overall speedup depends on the distribution 
of common terms in your query traffic.  If you have a few terms that are 
searched very frequently then it might pay off.  In my experience with 
general-purpose search engines this is not usually the case: folks seem 
to use rarer words in queries than they do in ordinary text.  But in 
some search applications perhaps the traffic is more skewed.  Only some 
experiments would tell for sure.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Caching of TermDocs

Posted by John Patterson <de...@hotmail.com>.

The caching by TermScorer of the next 32 Docs is a way to speed up the
serial (in order) reading of docs from the TermDocs object (probably coming
direct from disk).

I would like to hold a significant amount of the index in memory but use the
disk index as a spill over.  Obviously the best situation is to hold in
memory only the information that is likely to be used again soon.  It seems
that caching TermDocs would allow popular search terms to be searched more
efficiently while the less common terms would need to be read from disk.

Has anyone else done this?  Know of a better approach?

----- Original Message ----- 
From: "Paul Elschot" <pa...@xs4all.nl>
To: <lu...@jakarta.apache.org>
Sent: Tuesday, July 27, 2004 3:07 AM
Subject: Re: Caching of TermDocs

> On Monday 26 July 2004 21:41, John Patterson wrote:
>
> > Is there any way to cache TermDocs?  Is this a good idea?
>
> Lucene does this internally by buffering
> up to 32 document numbers in advance for a query Term.
> You can view the details here in case you're interested:
>
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
> It uses the TermDocs.read() method to fill a buffer of document numbers.
>
> Is this what you had in mind?
>
> Regards,
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Caching of TermDocs

Posted by Paul Elschot <pa...@xs4all.nl>.

On Monday 26 July 2004 21:41, John Patterson wrote:

> Is there any way to cache TermDocs?  Is this a good idea?

Lucene does this internally by buffering
up to 32 document numbers in advance for a query Term.
You can view the details here in case you're interested:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
It uses the TermDocs.read() method to fill a buffer of document numbers.

Is this what you had in mind?

Regards,
Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org