You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kan Deng <ka...@yahoo.com> on 2006/01/12 02:37:25 UTC

Cache index in RAMDirectory and evict

Hi, there, 

In "Lucene in action", it mentions in Section 3.2.3
"reading indexes into memory" that, 

"...RAMDirectory's constructor can be used to read a
file system-based index into memory, allowing the
application that accesses it to benefit from the
superior speed of the RAM:

   RAMDirectory ramDir = new RAMDirectory(dir)"

Some questions here need help,

1. Suppose the content in the FSDirectory index is
read-only, but since it is so big that it exceeds the
capacity of the JVM heap space. When constructing a
RAMDirectory to cache the entire FSDirectory, will it
blow the JVM?

2. How to cache into the RAMDirectory with the most
frequently used index parts from the FSDirectory?

   The purpose is that to serve search query, first of
all, search it in the RAMDirectory, if missed, goto
FSDirectory. 

   My question is how to implement this
RAMDirectory-based cache. I assume it takes 3 steps.
Is it an appropriate workflow?

   a) Search in the RAMDirectory. 
   b) If missed, search in the FSDirectory
   c) Add the documents from the FSDirectory to
RAMDirectory, 
      and remove some less frequently used document
from the RAMDirectory to save memory consumption. 


3. To make the cache mechanism more powerful, we can
count the frequency of the usage of every document in
the RAMDirectory, and evict those less frequently used
documents. 

   How to implement the eviction? In details, is it
good enough by counting the usage of each documents in
the index, and delete those documents not used very
often? Any better idea?


thanks,
Kan


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Cache index in RAMDirectory and evict

Posted by Kan Deng <ka...@yahoo.com>.
John, thanks a lot for your excellent reply. 

Especially, I think this sentence is very convincing, 

> "Well, you _can_ be a lot better since you know what
you're 
> doing. You can also be a _lot_ worse when you get it
wrong.

With such a high risk, probably I should try other
tricks to improve the system performance, before
rushing into the implementation of cache. 

thanks again,
Kan



--- John Haxby <jc...@scalix.com> wrote:

> Kan Deng wrote:
> 
> >1. Performance. 
> >
> >   Since all the cached disk data resides outside
> JVM
> >heap space, the access efficiency from Java object
> to
> >those cached data cannot be too high.
> >  
> >
> True, but you need to compare the relative speeds.  
> If data has to be 
> pulled from a file, then you're talking several
> milliseconds to fetch 
> from the disk.  If it's in the OS's cache (and here
> I'm rather assuming 
> Linux since that's what I know about) you're talking
> about microseconds 
> rather than milliseconds to fetch the data from the
> OS.   Once the data 
> is in the JVM, but not in the CPU cache, then you're
> down to nanosecods 
> to get the data from main memory (how many depends
> on the hardware; some 
> platforms take a while to get the data moving but
> when it comes, it's 
> very quick; some systems are fast to get going but
> don't have the 
> throughput).   It's not the absolute times that are
> important though: 
> once you've got the data in the OS's cache then
> things like network 
> latency, display update speed and scheduling
> overheads begin to make 
> themselves felt and you won't make these any less by
> getting data into 
> the JVM's memory.   Well, not much anyway.
> 
> >2. Volatile.
> >
> >   Since the OS caches the disk data in a common
> area
> >shared by multiple processes, but not only JVM. If
> >there are other processes doing disk IO at the same
> >time, chances are the cached Lucene index data from
> >disk may be wiped. 
> >  
> >
> What you can do by hanging on to a lot of memory is
> make the overall 
> machine performance worse.  In fact by denying other
> processes memory, 
> you're going to force up the I/O rate and when you
> do need to go to the 
> disk then it'll take much longer -- net result,
> things run slower.    
> Generally speaking, because the OS has a more
> holistic view of resource 
> management, you'll get better overall performance.
> 
> >Therefore, a more reliable and efficient cache
> should
> >reside inside JVM heap space. But due to the
> crowded
> >JVM heap space, we have to manually "evict" the
> less
> >frequently used data from the cache. 
> >  
> >
> It's that last sentence that is the critical one.  
> Yes, you can do your 
> own cache management, but how much better are you
> going to be than the 
> OS?    Well, you _can_ be a lot better since you
> know what you're 
> doing.   You can also be a _lot_ worse when you get
> it wrong.   Choosing 
> the right point to flush data from the cache
> ("evict") is not all that 
> straightforward: the OS buffer cache was introduced
> into BSD unix in the 
> early '80s and we're still seeing work going on to
> improve the basic 
> strategy 20-odd years later.
> 
> If you find that you're spending an inordinate
> amount of time waiting 
> for I/O for the index from the OS, then that it the
> time to start 
> looking at caching strategies.   My own feeling is
> that you're going to 
> find easier things to fix before you get that far.
> 
> >Did I mis-understand anything?
> >  
> >
> Probably not, it's just that performance is more of
> an holistic approach 
> and an obvious, isolated, change isn't going to have
> the effect that you 
> want.
> 
> jch
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Cache index in RAMDirectory and evict

Posted by John Haxby <jc...@scalix.com>.
Kan Deng wrote:

>1. Performance. 
>
>   Since all the cached disk data resides outside JVM
>heap space, the access efficiency from Java object to
>those cached data cannot be too high.
>  
>
True, but you need to compare the relative speeds.   If data has to be 
pulled from a file, then you're talking several milliseconds to fetch 
from the disk.  If it's in the OS's cache (and here I'm rather assuming 
Linux since that's what I know about) you're talking about microseconds 
rather than milliseconds to fetch the data from the OS.   Once the data 
is in the JVM, but not in the CPU cache, then you're down to nanosecods 
to get the data from main memory (how many depends on the hardware; some 
platforms take a while to get the data moving but when it comes, it's 
very quick; some systems are fast to get going but don't have the 
throughput).   It's not the absolute times that are important though: 
once you've got the data in the OS's cache then things like network 
latency, display update speed and scheduling overheads begin to make 
themselves felt and you won't make these any less by getting data into 
the JVM's memory.   Well, not much anyway.

>2. Volatile.
>
>   Since the OS caches the disk data in a common area
>shared by multiple processes, but not only JVM. If
>there are other processes doing disk IO at the same
>time, chances are the cached Lucene index data from
>disk may be wiped. 
>  
>
What you can do by hanging on to a lot of memory is make the overall 
machine performance worse.  In fact by denying other processes memory, 
you're going to force up the I/O rate and when you do need to go to the 
disk then it'll take much longer -- net result, things run slower.    
Generally speaking, because the OS has a more holistic view of resource 
management, you'll get better overall performance.

>Therefore, a more reliable and efficient cache should
>reside inside JVM heap space. But due to the crowded
>JVM heap space, we have to manually "evict" the less
>frequently used data from the cache. 
>  
>
It's that last sentence that is the critical one.   Yes, you can do your 
own cache management, but how much better are you going to be than the 
OS?    Well, you _can_ be a lot better since you know what you're 
doing.   You can also be a _lot_ worse when you get it wrong.   Choosing 
the right point to flush data from the cache ("evict") is not all that 
straightforward: the OS buffer cache was introduced into BSD unix in the 
early '80s and we're still seeing work going on to improve the basic 
strategy 20-odd years later.

If you find that you're spending an inordinate amount of time waiting 
for I/O for the index from the OS, then that it the time to start 
looking at caching strategies.   My own feeling is that you're going to 
find easier things to fix before you get that far.

>Did I mis-understand anything?
>  
>
Probably not, it's just that performance is more of an holistic approach 
and an obvious, isolated, change isn't going to have the effect that you 
want.

jch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Cache index in RAMDirectory and evict

Posted by Kan Deng <ka...@yahoo.com>.
Thanks, Otis. 

Also appreciate your wonderful book, "Lucene in
Action". The book is so well written that it makes me
very curious about the low level design of the system,
in addition to how to use it. 

Back the cache problem, I agree that the native OS
file system can do most of the job for me. However,
there are two issues if we only rely on OS's cache.

1. Performance. 

   Since all the cached disk data resides outside JVM
heap space, the access efficiency from Java object to
those cached data cannot be too high.

2. Volatile.

   Since the OS caches the disk data in a common area
shared by multiple processes, but not only JVM. If
there are other processes doing disk IO at the same
time, chances are the cached Lucene index data from
disk may be wiped. 

Therefore, a more reliable and efficient cache should
reside inside JVM heap space. But due to the crowded
JVM heap space, we have to manually "evict" the less
frequently used data from the cache. 

Did I mis-understand anything?

thanks again, Otis.
Kan




--- Otis Gospodnetic <ot...@yahoo.com>
wrote:

> Kan,
> 
> Some (all?) of what you described will typically be
> handled for you by the file system.  Yes, the JVM
> would blow up with a OOM error if the index is too
> big to fit in RAM.
> 
> Otis
> 
> ----- Original Message ----
> From: Kan Deng <ka...@yahoo.com>
> To: java-user@lucene.apache.org
> Cc: Kan Deng <ka...@yahoo.com>
> Sent: Wed 11 Jan 2006 08:37:25 PM EST
> Subject: Cache index in RAMDirectory and evict
> 
> Hi, there, 
> 
> In "Lucene in action", it mentions in Section 3.2.3
> "reading indexes into memory" that, 
> 
> "...RAMDirectory's constructor can be used to read a
> file system-based index into memory, allowing the
> application that accesses it to benefit from the
> superior speed of the RAM:
> 
>    RAMDirectory ramDir = new RAMDirectory(dir)"
> 
> Some questions here need help,
> 
> 1. Suppose the content in the FSDirectory index is
> read-only, but since it is so big that it exceeds
> the
> capacity of the JVM heap space. When constructing a
> RAMDirectory to cache the entire FSDirectory, will
> it
> blow the JVM?
> 
> 2. How to cache into the RAMDirectory with the most
> frequently used index parts from the FSDirectory?
> 
>    The purpose is that to serve search query, first
> of
> all, search it in the RAMDirectory, if missed, goto
> FSDirectory. 
> 
>    My question is how to implement this
> RAMDirectory-based cache. I assume it takes 3 steps.
> Is it an appropriate workflow?
> 
>    a) Search in the RAMDirectory. 
>    b) If missed, search in the FSDirectory
>    c) Add the documents from the FSDirectory to
> RAMDirectory, 
>       and remove some less frequently used document
> from the RAMDirectory to save memory consumption. 
> 
> 
> 3. To make the cache mechanism more powerful, we can
> count the frequency of the usage of every document
> in
> the RAMDirectory, and evict those less frequently
> used
> documents. 
> 
>    How to implement the eviction? In details, is it
> good enough by counting the usage of each documents
> in
> the index, and delete those documents not used very
> often? Any better idea?
> 
> 
> thanks,
> Kan
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Cache index in RAMDirectory and evict

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Kan,

Some (all?) of what you described will typically be handled for you by the file system.  Yes, the JVM would blow up with a OOM error if the index is too big to fit in RAM.

Otis

----- Original Message ----
From: Kan Deng <ka...@yahoo.com>
To: java-user@lucene.apache.org
Cc: Kan Deng <ka...@yahoo.com>
Sent: Wed 11 Jan 2006 08:37:25 PM EST
Subject: Cache index in RAMDirectory and evict

Hi, there, 

In "Lucene in action", it mentions in Section 3.2.3
"reading indexes into memory" that, 

"...RAMDirectory's constructor can be used to read a
file system-based index into memory, allowing the
application that accesses it to benefit from the
superior speed of the RAM:

   RAMDirectory ramDir = new RAMDirectory(dir)"

Some questions here need help,

1. Suppose the content in the FSDirectory index is
read-only, but since it is so big that it exceeds the
capacity of the JVM heap space. When constructing a
RAMDirectory to cache the entire FSDirectory, will it
blow the JVM?

2. How to cache into the RAMDirectory with the most
frequently used index parts from the FSDirectory?

   The purpose is that to serve search query, first of
all, search it in the RAMDirectory, if missed, goto
FSDirectory. 

   My question is how to implement this
RAMDirectory-based cache. I assume it takes 3 steps.
Is it an appropriate workflow?

   a) Search in the RAMDirectory. 
   b) If missed, search in the FSDirectory
   c) Add the documents from the FSDirectory to
RAMDirectory, 
      and remove some less frequently used document
from the RAMDirectory to save memory consumption. 


3. To make the cache mechanism more powerful, we can
count the frequency of the usage of every document in
the RAMDirectory, and evict those less frequently used
documents. 

   How to implement the eviction? In details, is it
good enough by counting the usage of each documents in
the index, and delete those documents not used very
often? Any better idea?


thanks,
Kan


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org