You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Stoppelman <st...@gmail.com> on 2008/11/29 18:45:38 UTC

indexing issue

Hi all,

I've got an indexing issue I think other folks might be interested in
hearing about and I wanted to get feedback before I went ahead and
implemented a new method.

Currently, the way we update indices is by sending individual delete/add
document requests to all our search boxes individually. Each box is doing
about 20-30qps while this is happening. The problem I'm seeing is that when
a segment from the index is merged [honestly I don't know that much about
segment merging] (our merge factor is set to 5) and an old highly used
segment of the index is lost from the disk cache; most of the search
requests to that box get prohibitively slow 10-80+ secs and I see pg/in +
pg/out stats spike sar. I'm planning on implementing a method similar to the
SOLR model using the rsync method that Doug Cutting outlined a long time ago
on this list and forcing the new files into the disk cache using fadvice.

Is there another strategy here? Could I create a merge policy that forces
new segments into the disk cache before lucene nukes the old ones?

Thanks,
M

Re: indexing issue

Posted by Michael Stoppelman <st...@gmail.com>.
On Sat, Nov 29, 2008 at 11:11 AM, Yonik Seeley <yo...@apache.org> wrote:

> On Sat, Nov 29, 2008 at 12:45 PM, Michael Stoppelman <st...@gmail.com>
> wrote:
> > Hi all,
> >
> > I've got an indexing issue I think other folks might be interested in
> > hearing about and I wanted to get feedback before I went ahead and
> > implemented a new method.
> >
> > Currently, the way we update indices is by sending individual delete/add
> > document requests to all our search boxes individually. Each box is doing
> > about 20-30qps while this is happening. The problem I'm seeing is that
> when
> > a segment from the index is merged [honestly I don't know that much about
> > segment merging] (our merge factor is set to 5) and an old highly used
> > segment of the index is lost from the disk cache; most of the search
> > requests to that box get prohibitively slow 10-80+ secs and I see pg/in +
> > pg/out stats spike sar.
>
> > I'm planning on implementing a method similar to the
> > SOLR model using the rsync method that Doug Cutting outlined a long time
> ago
> > on this list and forcing the new files into the disk cache using fadvice.
>
> FYI, if you don't actually want to use rsync, Solr has very recently
> implemented this in pure Java.
> But forcing new files into the cache means ejecting currently used
> files from the cache (for queries currently in flight).  Doesn't seem
> any way to really win here if you don't have enough memory to hold the
> critical parts of both indexes.


I found a rsync patch that has a flag that lets you tell the kernel to not
disk cache the new files being copied over (adds the --drop-caches flag).

http://lists.samba.org/archive/rsync/2007-May/017707.html


>
>
> I think Mike pointed out in a recent post that it would be nice to
> advise the kernel not to cache files being written during merging.  In
> Solr, we would want to allow the same thing when copying over new
> segment files during replication.  Unfortunately, I don't think there
> is a way to do this through Java.
>
> -Yonik
>
> > Is there another strategy here? Could I create a merge policy that forces
> > new segments into the disk cache before lucene nukes the old ones?
> >
> > Thanks,
> > M
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: indexing issue

Posted by Yonik Seeley <yo...@apache.org>.
On Sat, Nov 29, 2008 at 12:45 PM, Michael Stoppelman <st...@gmail.com> wrote:
> Hi all,
>
> I've got an indexing issue I think other folks might be interested in
> hearing about and I wanted to get feedback before I went ahead and
> implemented a new method.
>
> Currently, the way we update indices is by sending individual delete/add
> document requests to all our search boxes individually. Each box is doing
> about 20-30qps while this is happening. The problem I'm seeing is that when
> a segment from the index is merged [honestly I don't know that much about
> segment merging] (our merge factor is set to 5) and an old highly used
> segment of the index is lost from the disk cache; most of the search
> requests to that box get prohibitively slow 10-80+ secs and I see pg/in +
> pg/out stats spike sar.

> I'm planning on implementing a method similar to the
> SOLR model using the rsync method that Doug Cutting outlined a long time ago
> on this list and forcing the new files into the disk cache using fadvice.

FYI, if you don't actually want to use rsync, Solr has very recently
implemented this in pure Java.
But forcing new files into the cache means ejecting currently used
files from the cache (for queries currently in flight).  Doesn't seem
any way to really win here if you don't have enough memory to hold the
critical parts of both indexes.

I think Mike pointed out in a recent post that it would be nice to
advise the kernel not to cache files being written during merging.  In
Solr, we would want to allow the same thing when copying over new
segment files during replication.  Unfortunately, I don't think there
is a way to do this through Java.

-Yonik

> Is there another strategy here? Could I create a merge policy that forces
> new segments into the disk cache before lucene nukes the old ones?
>
> Thanks,
> M

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org