You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Darren Govoni <da...@ontrenet.com> on 2008/08/13 02:56:09 UTC

Re: possible to read index into memory?

Hello,
  The kind sir below recommended the RAMDirectory for loading an on-disk
index into memory (the entire data) and using IndexSearcher off that. It
seemed to worked very well. 

On one index, I am seeing no speed change when flipping between
RAMDirectory IndexSearcher and file system version.

Creating the RAMDirectory from the on-disk index only takes 0.09
seconds. It appears it is not loading the data into memory, but maybe
just the file names of the index?

How can I load an on-disk index - the data - into memory and run
searches there?

thanks for any help. you guys are awesome!
D

On Thu, 2008-06-26 at 15:47 -0400, Erick Erickson wrote:
> >From the docs...
> 
> RAMDirectory
> 
> public *RAMDirectory*(Directory
> <file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/store/Directory.html>
> dir)
>              throws IOException
> <http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html>
> 
> Creates a new RAMDirectory instance from a different
> Directoryimplementation. This can be used to load a disk-based index
> into memory.
> 
> Seems like exactly what you're asking for...
> 
> Best
> Erick
> 
> On Thu, Jun 26, 2008 at 3:40 PM, Darren Govoni <da...@ontrenet.com> wrote:
> 
> > Hi,
> >  Is there a lucene index reader that will load a disk-based index into
> > memory and perform searches on it from RAM? Sorry if I missed this in
> > the docs somewhere.
> >
> > Darren
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: possible to read index into memory?

Posted by Darren Govoni <da...@ontrenet.com>.

Erick,
    Thank you for the valuable tips. The time I'm measuring is
just around the lucene search calls with standard analyzer, such as:

    word = "helloo"
    starttime = ...
    query = QueryParser("word", analyzer).parse(word+"~0.76")
    hits = searcher.search(query)
    endtime = ...
    endtime-starttime = .33 seconds

Its a fuzzy match, which I presume should take longer, but for a single
word in the query against a tiny 17MB index, the above code takes

.33 seconds for the first couple dozen or so, then about .15-.20 after
that. Still way way too long for a simple query as this. Do those
figures sound right for Lucene doing this kind of single field match?

Darren

On Wed, 2008-08-13 at 10:24 -0400, Erick Erickson wrote:
> How are you measuring? There is a bunch of setup work for the first
> few queries that go through the system. In either case (RAM or FS),
> you should fire a few representative warmup queries at the search
> engine before you go ahead and measure the response time.
> 
> You also *must* isolate your search time from your response
> assembly time. That is, if you have something like
> Hits hits = search()
> for (each element of hits) {
>    do something with the hit
> }
> 
> you MUST measure the time for the search() call exclusive of the
> for loop before you know where to concentrate your efforts.
> 
> In this example, if you get more than 100 hits, your query is
> actually re-executed every 100 times through the above loop.
> 
> There are other gotchas if you process your query results other ways,
> so be sure you know exactly what is taking the time before worrying
> about the proper way to speed things up.
> 
> I strongly suspect that the RAMDir is a complete red herring. a 17M index
> will almost certainly be cached by the system after a bit of use.
> 
> There's a whole section up on the Lucene website that talks about various
> ways to speed up processing....
> 
> Measure, *then* optimize <G>......
> 
> Best
> Erick
> 
> On Wed, Aug 13, 2008 at 7:42 AM, Darren Govoni <da...@ontrenet.com> wrote:
> 
> > Hoss,
> >   Thank you for the detailed response. What I found weird was it
> > seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
> > Suspiciously fast, but ok.
> >
> > Yet, when I do a simple fuzzy search on a single field
> >
> > "word: someword~0.76"
> >
> > It was taking .35 seconds. That's a very very long time all things
> > considered. I understand about the OS paging and such but in
> > doing some variations of this to "throw the OS off", I still saw
> > no difference between on-disk and RAM times. But despite that, the
> > times are really slow.
> >
> > Any ideas?
> >
> > thanks again,
> > Darren
> >
> > On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> > > : On one index, I am seeing no speed change when flipping between
> > > : RAMDirectory IndexSearcher and file system version.
> > >
> > > that is probably because even if you just use an FSDirectory, your OS
> > will
> > > cache the disk "pages" in RAM for you -- all using a RAMDirectory does
> > for
> > > you is garuntee that the entire index is copied into the heap you
> > allocate
> > > for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you
> > > allocated 12GB of RAM to the JVM and read your index into a RAMDirectory,
> > > your index will always be in RAM, no matter what other processes do on
> > > your machine.
> > >
> > > If instead you only allocate 6GB of RAM to the JVM, and nothing else is
> > > using up the rest of your RAM, the OS has plenty to load the whole index
> > > into RAM as part of the filesystem cache once you use it -- but if
> > another
> > > process comes along and really needs that RAM (or if something reads a
> > lot
> > > of other pages of disk) your index might get bumped from the filesystem
> > > cache, and the next few reads could be slow.
> > >
> > > : Creating the RAMDirectory from the on-disk index only takes 0.09
> > > : seconds. It appears it is not loading the data into memory, but maybe
> > > : just the file names of the index?
> > >
> > > passing an FSDIrectory to the constructor of a RAMDIrectory uses the
> > > Directory.copy() method whose source is fairly straight forward and easy
> > > to read -- unless your index is ginormous it's not suprising that it's
> > > "fast" particularly if it's already in the filesystem cache.
> > >
> > >
> > >
> > >
> > > -Hoss
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: possible to read index into memory?

Posted by Erick Erickson <er...@gmail.com>.

How are you measuring? There is a bunch of setup work for the first
few queries that go through the system. In either case (RAM or FS),
you should fire a few representative warmup queries at the search
engine before you go ahead and measure the response time.

You also *must* isolate your search time from your response
assembly time. That is, if you have something like
Hits hits = search()
for (each element of hits) {
   do something with the hit
}

you MUST measure the time for the search() call exclusive of the
for loop before you know where to concentrate your efforts.

In this example, if you get more than 100 hits, your query is
actually re-executed every 100 times through the above loop.

There are other gotchas if you process your query results other ways,
so be sure you know exactly what is taking the time before worrying
about the proper way to speed things up.

I strongly suspect that the RAMDir is a complete red herring. a 17M index
will almost certainly be cached by the system after a bit of use.

There's a whole section up on the Lucene website that talks about various
ways to speed up processing....

Measure, *then* optimize <G>......

Best
Erick

On Wed, Aug 13, 2008 at 7:42 AM, Darren Govoni <da...@ontrenet.com> wrote:

> Hoss,
>   Thank you for the detailed response. What I found weird was it
> seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
> Suspiciously fast, but ok.
>
> Yet, when I do a simple fuzzy search on a single field
>
> "word: someword~0.76"
>
> It was taking .35 seconds. That's a very very long time all things
> considered. I understand about the OS paging and such but in
> doing some variations of this to "throw the OS off", I still saw
> no difference between on-disk and RAM times. But despite that, the
> times are really slow.
>
> Any ideas?
>
> thanks again,
> Darren
>
> On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> > : On one index, I am seeing no speed change when flipping between
> > : RAMDirectory IndexSearcher and file system version.
> >
> > that is probably because even if you just use an FSDirectory, your OS
> will
> > cache the disk "pages" in RAM for you -- all using a RAMDirectory does
> for
> > you is garuntee that the entire index is copied into the heap you
> allocate
> > for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you
> > allocated 12GB of RAM to the JVM and read your index into a RAMDirectory,
> > your index will always be in RAM, no matter what other processes do on
> > your machine.
> >
> > If instead you only allocate 6GB of RAM to the JVM, and nothing else is
> > using up the rest of your RAM, the OS has plenty to load the whole index
> > into RAM as part of the filesystem cache once you use it -- but if
> another
> > process comes along and really needs that RAM (or if something reads a
> lot
> > of other pages of disk) your index might get bumped from the filesystem
> > cache, and the next few reads could be slow.
> >
> > : Creating the RAMDirectory from the on-disk index only takes 0.09
> > : seconds. It appears it is not loading the data into memory, but maybe
> > : just the file names of the index?
> >
> > passing an FSDIrectory to the constructor of a RAMDIrectory uses the
> > Directory.copy() method whose source is fairly straight forward and easy
> > to read -- unless your index is ginormous it's not suprising that it's
> > "fast" particularly if it's already in the filesystem cache.
> >
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: possible to read index into memory?

Posted by Darren Govoni <da...@ontrenet.com>.

Hoss,
   Thank you for the detailed response. What I found weird was it
seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
Suspiciously fast, but ok.

Yet, when I do a simple fuzzy search on a single field 

"word: someword~0.76"

It was taking .35 seconds. That's a very very long time all things
considered. I understand about the OS paging and such but in
doing some variations of this to "throw the OS off", I still saw
no difference between on-disk and RAM times. But despite that, the
times are really slow. 

Any ideas?

thanks again,
Darren

On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> : On one index, I am seeing no speed change when flipping between
> : RAMDirectory IndexSearcher and file system version.
> 
> that is probably because even if you just use an FSDirectory, your OS will 
> cache the disk "pages" in RAM for you -- all using a RAMDirectory does for 
> you is garuntee that the entire index is copied into the heap you allocate 
> for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you 
> allocated 12GB of RAM to the JVM and read your index into a RAMDirectory, 
> your index will always be in RAM, no matter what other processes do on 
> your machine.
> 
> If instead you only allocate 6GB of RAM to the JVM, and nothing else is 
> using up the rest of your RAM, the OS has plenty to load the whole index 
> into RAM as part of the filesystem cache once you use it -- but if another 
> process comes along and really needs that RAM (or if something reads a lot 
> of other pages of disk) your index might get bumped from the filesystem 
> cache, and the next few reads could be slow.
> 
> : Creating the RAMDirectory from the on-disk index only takes 0.09
> : seconds. It appears it is not loading the data into memory, but maybe
> : just the file names of the index?
> 
> passing an FSDIrectory to the constructor of a RAMDIrectory uses the 
> Directory.copy() method whose source is fairly straight forward and easy 
> to read -- unless your index is ginormous it's not suprising that it's 
> "fast" particularly if it's already in the filesystem cache.
> 
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: possible to read index into memory?

Posted by Chris Hostetter <ho...@fucit.org>.

: On one index, I am seeing no speed change when flipping between
: RAMDirectory IndexSearcher and file system version.

that is probably because even if you just use an FSDirectory, your OS will 
cache the disk "pages" in RAM for you -- all using a RAMDirectory does for 
you is garuntee that the entire index is copied into the heap you allocate 
for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you 
allocated 12GB of RAM to the JVM and read your index into a RAMDirectory, 
your index will always be in RAM, no matter what other processes do on 
your machine.

If instead you only allocate 6GB of RAM to the JVM, and nothing else is 
using up the rest of your RAM, the OS has plenty to load the whole index 
into RAM as part of the filesystem cache once you use it -- but if another 
process comes along and really needs that RAM (or if something reads a lot 
of other pages of disk) your index might get bumped from the filesystem 
cache, and the next few reads could be slow.

: Creating the RAMDirectory from the on-disk index only takes 0.09
: seconds. It appears it is not loading the data into memory, but maybe
: just the file names of the index?

passing an FSDIrectory to the constructor of a RAMDIrectory uses the 
Directory.copy() method whose source is fairly straight forward and easy 
to read -- unless your index is ginormous it's not suprising that it's 
"fast" particularly if it's already in the filesystem cache.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: possible to read index into memory?

Posted by Kalani Ruwanpathirana <ka...@gmail.com>.

Did you try this?

byte [] buffer = new byte [100] ;
LuceneUtils.copy(fsDir, ramDir, buffer);


Kalani

On Wed, Aug 13, 2008 at 6:26 AM, Darren Govoni <da...@ontrenet.com> wrote:

> Hello,
>  The kind sir below recommended the RAMDirectory for loading an on-disk
> index into memory (the entire data) and using IndexSearcher off that. It
> seemed to worked very well.
>
> On one index, I am seeing no speed change when flipping between
> RAMDirectory IndexSearcher and file system version.
>
> Creating the RAMDirectory from the on-disk index only takes 0.09
> seconds. It appears it is not loading the data into memory, but maybe
> just the file names of the index?
>
> How can I load an on-disk index - the data - into memory and run
> searches there?
>
> thanks for any help. you guys are awesome!
> D
>
> On Thu, 2008-06-26 at 15:47 -0400, Erick Erickson wrote:
> > >From the docs...
> >
> > RAMDirectory
> >
> > public *RAMDirectory*(Directory
> > <file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/store/Directory.html>
> > dir)
> >              throws IOException
> > <http://java.sun.com/j2se/1.4/docs/api/java/io/IOException.html>
> >
> > Creates a new RAMDirectory instance from a different
> > Directoryimplementation. This can be used to load a disk-based index
> > into memory.
> >
> > Seems like exactly what you're asking for...
> >
> > Best
> > Erick
> >
> > On Thu, Jun 26, 2008 at 3:40 PM, Darren Govoni <da...@ontrenet.com>
> wrote:
> >
> > > Hi,
> > >  Is there a lucene index reader that will load a disk-based index into
> > > memory and perform searches on it from RAM? Sorry if I missed this in
> > > the docs somewhere.
> > >
> > > Darren
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa