You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kevin A. Burton" <bu...@newsmonster.org> on 2004/03/11 11:29:18 UTC

Real time indexing and distribution to lucene on separate boxes (long)

I'm curious to find out what others are doing in this situation.

I have two boxes... the indexer and the searcher.  The indexer is taking
documents and indexing them and creating indexes in a RAMDirectory (for
efficiency) and is then writing these indexes to disk as we begin to run 
out of
memory.  Usually these aren't very big... 15->100M or so.

Obviously I'm dividing the indexing and searching onto dedicated boxes to
improve efficiency.  The real issue though is that the searchers need to 
be live
all the time as indexes are being added at runtime.

So if that wasn't clear.  I actually have to push out fresh indexes 
WHILE users
are searching them.  Not a very easy thing to do.

Here's my question.  What are the optimum ways to then distribute these 
index
segments to the secondary searcher boxes.  I don't want to use the 
MultiSearcher
because it's slow once we have too many indexes (see my PS)

Here's what I'm currently thinking:

1.  Have the indexes sync'd to the searcher as shards directly.  This 
doesn't
scale as I would have to use the MultiSearcher which is slow when it has too
many indexes.  (And ideally we would want an optimized index).

2. Merge everything into one index on the indexer.  Lock the searcher, 
then copy
over the new index via rsync.  The problem here is that the searcher 
would need
to lock up while the sync is happening to prevent reads on the index.  
If I do
this enough and the system is optimzed I think I would only have to 
block for 5
seconds or so but that's STILL very long.

3. Have two directories on the searcher.  The indexer would then sync to 
a tmp
directory and then at run time swap them via a rename once the sync is over.
The downside here is that this will take up 2x disk space on the 
searcher.  The
upside is that the box will only slow down while the rsync is happening.

4. Do a LIVE index merge on the production box.  This might be an 
interesting
approach.  The major question I have is whether you can do an 
optimize/merge on
an index that's currently being used.  I *think* it might be possible 
but I'm
not sure.  This isn't as fast as performing the merge on the indexer 
before hand
but it does have the benefits of both worlds.

If anyone has any other ideas I would be all ears...

PS.. Random question.  The performance of the MultiSearcher is Mlog(N) 
correct?
Where N is the number of documents in the index and M is the number of 
indexes?
Is this about right?

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Matalon wrote:

>To clarify how option 3 works:
>
>You have dira where the search is done and dirb where the indexing is
>done. dirb grows when you add new items to it, and at some point you
>swap and dirb becomes dira, but what do you do then?
>  
>
The Searcher reloads and points to dira...

>Also, how do you write from the indexer to the directory on the search box?
>  
>
We rsync the content over...

>2. The index is NFS mounted. The indexer keeps writing to the index, and
>at defined times, creates a NFS snapshot of the index. It then creates
>an entry in a db to let the searcher know that a new snapshot has been
>created.
>The searcher checks once a minute the db to see if there's a new
>snapshot. If there is one, it opens the index in the new snapshot and
>swaps it for the old one. The code to do this is synchronized.
>
>The nice thing about this solution is that you don't have just one copy
>of the index and don't do any copying. But you need to use NFS and
>snapshots.
>  
>
Well... right now I'm thinking that if I can do a merge on the box with 
< 200M per commit that this won't be too much of a burden on the 
searchers as long as it happens at regular intervals. 

Right now though I'm going to have to test this to make sure I can keep 
doing a query and an index merge on the same box with the merge 
happening in a diff process.

Going to send off an email about this in a minute :)

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by Dror Matalon <dr...@zapatec.com>.

To clarify how option 3 works:

You have dira where the search is done and dirb where the indexing is
done. dirb grows when you add new items to it, and at some point you
swap and dirb becomes dira, but what do you do then?

Also, how do you write from the indexer to the directory on the search box?

We had two different ideas on how to do this, we ended up
implementing option 2, but you need to have support for snapshots to be
able to do it.

1. Similar idea to what you were suggesting, but optimizing the index:
	1. The searcher reads from dira.
	2. The writer writes to dirb which contains only *new* documents that
	aren't in dira.
	3. Periodically the writer stops writing and merges/optimizes dira and
	dirb to dirc.
	4. When done, the writer removes dirb and starts writing to it new
	documents. 
	5. The searcher notices that dirc is ready. Renames dira to something
	else, renames dirc to dira.


The main problem with this approach is that the merging/optimizing can
be slow on large indexes. Even on fast machines with fast disks merging
several Gigs takes a while.

2. The index is NFS mounted. The indexer keeps writing to the index, and
at defined times, creates a NFS snapshot of the index. It then creates
an entry in a db to let the searcher know that a new snapshot has been
created.
The searcher checks once a minute the db to see if there's a new
snapshot. If there is one, it opens the index in the new snapshot and
swaps it for the old one. The code to do this is synchronized.

The nice thing about this solution is that you don't have just one copy
of the index and don't do any copying. But you need to use NFS and
snapshots.


Dror

On Thu, Mar 11, 2004 at 09:21:07AM -0800, Doug Cutting wrote:
> Kevin A. Burton wrote:
>  > 3. Have two directories on the searcher.  The indexer would then 
> sync to
> >a tmp
> >directory and then at run time swap them via a rename once the sync is 
> >over.
> >The downside here is that this will take up 2x disk space on the 
> >searcher.  The
> >upside is that the box will only slow down while the rsync is happening.
> 
> For maximal search performance, this is your best bet.  Disk space is 
> cheap.  At some point all newly issued queries start going against the 
> new index, and, pretty soon, you can close and delete the old index. 
> But you never have to stop searching.  Disk i/o will spike a bit during 
> the changeover, as the new working set is swapped in.
> 
> Note that, if folks are paging through results by re-querying (the 
> standard method) and the index is updated then things can get funky. 
> One approach is to hang onto the old index longer, to make this less 
> likely.  In any case, you might want to add an index-id to the search 
> parameters, so that, if a next-page is issued when the index is no 
> longer there you can give some sort of "stale query" error.
> 
> >PS.. Random question.  The performance of the MultiSearcher is Mlog(N) 
> >correct?
> >Where N is the number of documents in the index and M is the number of 
> >indexes?
> >Is this about right?
> 
> No.
> 
> The added cost of a MultiSearcher is mostly proportional to just M, the 
> number of indexes.  The normal cost of searching is still mostly 
> proportional to N, the number of documents.  So M+N would probably be 
> more accurate.  There is a log(M) here and there, to, e.g., figure out 
> which index a doc id belongs to, but I doubt these are significant.
> 
> The significant costs of a MultiSearcher over an IndexSearcher are that 
> it adds more term dictionary reads (one per query term per index) and 
> more seeks (also one per query term per index, or two if you're using 
> phrases).
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by Doug Cutting <cu...@apache.org>.

Kevin A. Burton wrote:
  > 3. Have two directories on the searcher.  The indexer would then 
sync to
> a tmp
> directory and then at run time swap them via a rename once the sync is 
> over.
> The downside here is that this will take up 2x disk space on the 
> searcher.  The
> upside is that the box will only slow down while the rsync is happening.

For maximal search performance, this is your best bet.  Disk space is 
cheap.  At some point all newly issued queries start going against the 
new index, and, pretty soon, you can close and delete the old index. 
But you never have to stop searching.  Disk i/o will spike a bit during 
the changeover, as the new working set is swapped in.

Note that, if folks are paging through results by re-querying (the 
standard method) and the index is updated then things can get funky. 
One approach is to hang onto the old index longer, to make this less 
likely.  In any case, you might want to add an index-id to the search 
parameters, so that, if a next-page is issued when the index is no 
longer there you can give some sort of "stale query" error.

> PS.. Random question.  The performance of the MultiSearcher is Mlog(N) 
> correct?
> Where N is the number of documents in the index and M is the number of 
> indexes?
> Is this about right?

No.

The added cost of a MultiSearcher is mostly proportional to just M, the 
number of indexes.  The normal cost of searching is still mostly 
proportional to N, the number of documents.  So M+N would probably be 
more accurate.  There is a log(M) here and there, to, e.g., figure out 
which index a doc id belongs to, but I doubt these are significant.

The significant costs of a MultiSearcher over an IndexSearcher are that 
it adds more term dictionary reads (one per query term per index) and 
more seeks (also one per query term per index, or two if you're using 
phrases).

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

--- "Kevin A. Burton" <bu...@newsmonster.org> wrote:
> Otis Gospodnetic wrote:
> 
> >I like option 3.  I've done it before, and it worked well.  I dealt
> >with very small indices, though, and if your indices are several
> tens
> >or hundred gigs, this may be hard for you.
> >
> >Option 4: search can be performed on an index that is being modified
> >(update, delete, insert, optimize).  You'd just have to make sure
> not
> >to recreate new IndexSearcher too frequently, if your index is being
> >modified often.  Just change it every X index modification or every
> X
> >minutes, and you'll be fine.
> >  
> >
> Right now I'm thinking about #4... Disk may be cheap but a fast RAID
> 10 
> array with 100G twice isn't THAT cheap... That's the worse case

Yes, but not everything needs to be on a fast RAID (you probabably are
using SCSI disks in RAID, which is what makes it expensive.  RAID
requires only a RAID controller).
You could have a Searcher machine with a set of cheap EIDE disks, and
use those as a copy target disks, which are not searched.
Once you transfer your indices there, you copy them on fast SCSI RAID
disks.

> Also... since the new indexes are SO small (~100M) the merges would 
> probably be easier on the machine than just doing a whole new write. 
> Of 
> course it's hard to make that argument with a 100G RAID array but
> we're 
> using rysnc to avoid distribution of network IO so the CPU
> computation 
> and network read would slow things down.
> 
> The only way around this is the re-upload the whole 100G index but
> even 
> over gigabit ethernet this will take 15 minutes.  This doesn't scale
> as we add more searchers.

I wonder what happens if you try compressing the indices before copying
them over the network.
I wonder if it makes a difference whether you use compound vs.
traditional directories.
I wonder what the index size is if you use DbDirectory instead of
FSDirectory.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.

Otis Gospodnetic wrote:

>I like option 3.  I've done it before, and it worked well.  I dealt
>with very small indices, though, and if your indices are several tens
>or hundred gigs, this may be hard for you.
>
>Option 4: search can be performed on an index that is being modified
>(update, delete, insert, optimize).  You'd just have to make sure not
>to recreate new IndexSearcher too frequently, if your index is being
>modified often.  Just change it every X index modification or every X
>minutes, and you'll be fine.
>  
>
Right now I'm thinking about #4... Disk may be cheap but a fast RAID 10 
array with 100G twice isn't THAT cheap... That's the worse case scenario 
of course and most modern search clusters use cheap hardware though...

Also... since the new indexes are SO small (~100M) the merges would 
probably be easier on the machine than just doing a whole new write.  Of 
course it's hard to make that argument with a 100G RAID array but we're 
using rysnc to avoid distribution of network IO so the CPU computation 
and network read would slow things down.

The only way around this is the re-upload the whole 100G index but even 
over gigabit ethernet this will take 15 minutes.  This doesn't scale as 
we add more searchers.

Thanks for the feedback!  I think now tha tI know that optmize is safe 
as long as I don't create a new reader... I'll be fine.  I do have think 
about how I'm going to handle search result nav.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Real time indexing and distribution to lucene on separate boxes (long)

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I like option 3.  I've done it before, and it worked well.  I dealt
with very small indices, though, and if your indices are several tens
or hundred gigs, this may be hard for you.

Option 4: search can be performed on an index that is being modified
(update, delete, insert, optimize).  You'd just have to make sure not
to recreate new IndexSearcher too frequently, if your index is being
modified often.  Just change it every X index modification or every X
minutes, and you'll be fine.

Otis


--- "Kevin A. Burton" <bu...@newsmonster.org> wrote:
> 
> I'm curious to find out what others are doing in this situation.
> 
> I have two boxes... the indexer and the searcher.  The indexer is
> taking
> documents and indexing them and creating indexes in a RAMDirectory
> (for
> efficiency) and is then writing these indexes to disk as we begin to
> run 
> out of
> memory.  Usually these aren't very big... 15->100M or so.
> 
> Obviously I'm dividing the indexing and searching onto dedicated
> boxes to
> improve efficiency.  The real issue though is that the searchers need
> to 
> be live
> all the time as indexes are being added at runtime.
> 
> So if that wasn't clear.  I actually have to push out fresh indexes 
> WHILE users
> are searching them.  Not a very easy thing to do.
> 
> Here's my question.  What are the optimum ways to then distribute
> these 
> index
> segments to the secondary searcher boxes.  I don't want to use the 
> MultiSearcher
> because it's slow once we have too many indexes (see my PS)
> 
> Here's what I'm currently thinking:
> 
> 1.  Have the indexes sync'd to the searcher as shards directly.  This
> 
> doesn't
> scale as I would have to use the MultiSearcher which is slow when it
> has too
> many indexes.  (And ideally we would want an optimized index).
> 
> 2. Merge everything into one index on the indexer.  Lock the
> searcher, 
> then copy
> over the new index via rsync.  The problem here is that the searcher 
> would need
> to lock up while the sync is happening to prevent reads on the index.
>  
> If I do
> this enough and the system is optimzed I think I would only have to 
> block for 5
> seconds or so but that's STILL very long.
> 
> 3. Have two directories on the searcher.  The indexer would then sync
> to 
> a tmp
> directory and then at run time swap them via a rename once the sync
> is over.
> The downside here is that this will take up 2x disk space on the 
> searcher.  The
> upside is that the box will only slow down while the rsync is
> happening.
> 
> 4. Do a LIVE index merge on the production box.  This might be an 
> interesting
> approach.  The major question I have is whether you can do an 
> optimize/merge on
> an index that's currently being used.  I *think* it might be possible
> 
> but I'm
> not sure.  This isn't as fast as performing the merge on the indexer 
> before hand
> but it does have the benefits of both worlds.
> 
> If anyone has any other ideas I would be all ears...
> 
> PS.. Random question.  The performance of the MultiSearcher is
> Mlog(N) 
> correct?
> Where N is the number of documents in the index and M is the number
> of 
> indexes?
> Is this about right?
> 
> -- 
> 
> Please reply using PGP.
> 
>     http://peerfear.org/pubkey.asc    
>     
>     NewsMonster - http://www.newsmonster.org/
>     
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> 

> ATTACHMENT part 2 application/pgp-signature name=signature.asc



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org