You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Will Johnson <wj...@GETCONNECTED.COM> on 2007/05/10 15:37:46 UTC

fast update handlers

I'm trying to setup a system to have very low index latency (1-2
seconds) and one of the javadocs intrigued me:

 

"DirectUpdateHandler2 implements an UpdateHandler where documents are
added directly to the main Lucene index as opposed to adding to a
separate smaller index"

 

The plain DirectUpdateHandler also had the same in its docs.  Does this
imply that there use to be another handler that could send docs to a
small/faster index and then merge them in with a larger one or that
someone could in the future?  I read through a good bit of the code and
didn't see how it could be handled from a searcher perspective but
perhaps I'm missing some key piece.

 

- will


Re: fast update handlers

Posted by Ryan McKinley <ry...@gmail.com>.
I don't know if this helps, but...

Do *all* your queries need to include the fast updates?  I have a setup 
where there are some cases that need the newest stuff but most cases can 
wait 5 mins (or so)

In that case, I have two solr instances pointing to the same index 
files.  One is used for updates and queries that need everything.  The 
other is a read-only index that serves the majority of queries.

What is nice about this is that you can set different cache sizes and 
auto-warming for the different cases.

ryan


Will Johnson wrote:
> The problem is I want the newly added documents to be made searchable
> every 1-2 seconds so I need the commits.  I was hoping that the caches
> could be stored/tied to the IndexSearcher then a MultiSearcher could
> take advantage of the multiple sub indexes and their respective caches.
> 
> 
> I think the best approach now will be to write a top level federator
> that can merge the large ~static index and the smaller more dynamic
> index.
> 
> - will
> 
> 
> 
> -----Original Message-----
> From: Charlie Jackson [mailto:Charlie.Jackson@cision.com] 
> Sent: Thursday, May 10, 2007 10:53 AM
> To: solr-user@lucene.apache.org
> Subject: RE: fast update handlers
> 
> What about issuing separate commits to the index on a regularly
> scheduled basis? For example, you add documents to the index every 2
> seconds, or however often, but these operations don't commit. Instead,
> you have a cron'd script or something that just issues a commit every 5
> or 10 minutes or whatever interval you'd like. 
> 
> I had to do something similar when I was running a re-index of my entire
> dataset. My program wasn't issuing commits, so I just cron'd a commit
> for every half hour so it didn't overload the server. 
> 
> Thanks,
> Charlie
> 
> 
> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Thursday, May 10, 2007 9:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: fast update handlers
> 
> On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
>> I guess I was more concerned with doing the frequent commits and how
>> that would affect the caches.  Say I have 2M docs in my main index but
> I
>> want to add docs every 2 seconds all while doing queries.  if I do
>> commits every 2 seconds I basically loose any caching advantage and my
>> faceting performance goes down the tube.  If however, I were to add
>> things to a smaller index and then roll it into the larger one every
> ~30
>> minutes then I only take the hit on computing the larger filters
> caches
>> on that interval.  Further, if my smaller index were based on a
>> RAMDirectory instead of a FSDirectory I assume computing the filter
> sets
>> for the smaller index should be fast enough even every 2 seconds.
> 
> There isn't currently any support for incrementally updating filters.
> 
> -Yonik
> 


RE: fast update handlers

Posted by Will Johnson <wj...@GETCONNECTED.COM>.
The problem is I want the newly added documents to be made searchable
every 1-2 seconds so I need the commits.  I was hoping that the caches
could be stored/tied to the IndexSearcher then a MultiSearcher could
take advantage of the multiple sub indexes and their respective caches.


I think the best approach now will be to write a top level federator
that can merge the large ~static index and the smaller more dynamic
index.

- will



-----Original Message-----
From: Charlie Jackson [mailto:Charlie.Jackson@cision.com] 
Sent: Thursday, May 10, 2007 10:53 AM
To: solr-user@lucene.apache.org
Subject: RE: fast update handlers

What about issuing separate commits to the index on a regularly
scheduled basis? For example, you add documents to the index every 2
seconds, or however often, but these operations don't commit. Instead,
you have a cron'd script or something that just issues a commit every 5
or 10 minutes or whatever interval you'd like. 

I had to do something similar when I was running a re-index of my entire
dataset. My program wasn't issuing commits, so I just cron'd a commit
for every half hour so it didn't overload the server. 

Thanks,
Charlie


-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:07 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
> I guess I was more concerned with doing the frequent commits and how
> that would affect the caches.  Say I have 2M docs in my main index but
I
> want to add docs every 2 seconds all while doing queries.  if I do
> commits every 2 seconds I basically loose any caching advantage and my
> faceting performance goes down the tube.  If however, I were to add
> things to a smaller index and then roll it into the larger one every
~30
> minutes then I only take the hit on computing the larger filters
caches
> on that interval.  Further, if my smaller index were based on a
> RAMDirectory instead of a FSDirectory I assume computing the filter
sets
> for the smaller index should be fast enough even every 2 seconds.

There isn't currently any support for incrementally updating filters.

-Yonik

RE: fast update handlers

Posted by Charlie Jackson <Ch...@cision.com>.
What about issuing separate commits to the index on a regularly
scheduled basis? For example, you add documents to the index every 2
seconds, or however often, but these operations don't commit. Instead,
you have a cron'd script or something that just issues a commit every 5
or 10 minutes or whatever interval you'd like. 

I had to do something similar when I was running a re-index of my entire
dataset. My program wasn't issuing commits, so I just cron'd a commit
for every half hour so it didn't overload the server. 

Thanks,
Charlie


-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:07 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
> I guess I was more concerned with doing the frequent commits and how
> that would affect the caches.  Say I have 2M docs in my main index but
I
> want to add docs every 2 seconds all while doing queries.  if I do
> commits every 2 seconds I basically loose any caching advantage and my
> faceting performance goes down the tube.  If however, I were to add
> things to a smaller index and then roll it into the larger one every
~30
> minutes then I only take the hit on computing the larger filters
caches
> on that interval.  Further, if my smaller index were based on a
> RAMDirectory instead of a FSDirectory I assume computing the filter
sets
> for the smaller index should be fast enough even every 2 seconds.

There isn't currently any support for incrementally updating filters.

-Yonik

Re: fast update handlers

Posted by Yonik Seeley <yo...@apache.org>.
On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
> I guess I was more concerned with doing the frequent commits and how
> that would affect the caches.  Say I have 2M docs in my main index but I
> want to add docs every 2 seconds all while doing queries.  if I do
> commits every 2 seconds I basically loose any caching advantage and my
> faceting performance goes down the tube.  If however, I were to add
> things to a smaller index and then roll it into the larger one every ~30
> minutes then I only take the hit on computing the larger filters caches
> on that interval.  Further, if my smaller index were based on a
> RAMDirectory instead of a FSDirectory I assume computing the filter sets
> for the smaller index should be fast enough even every 2 seconds.

There isn't currently any support for incrementally updating filters.

-Yonik

RE: fast update handlers

Posted by Chris Hostetter <ho...@fucit.org>.
: want to add docs every 2 seconds all while doing queries.  if I do
: commits every 2 seconds I basically loose any caching advantage and my
: faceting performance goes down the tube.  If however, I were to add
: things to a smaller index and then roll it into the larger one every ~30
: minutes then I only take the hit on computing the larger filters caches

searching across both of these indexes (the big and the little) would
require something like a MultiReader, a way to unify DocSets
between the two, and the ability to cache on the sub indexes and on the
main MultiReader.

fortunately, a MultiReader is exactly what Lucence uses under the covers
when dealing with an FSDIrectory, so we're half way there.  something like
these might get us the rest of the way...

https://issues.apache.org/jira/browse/LUCENE-831
https://issues.apache.org/jira/browse/LUCENE-743




-Hoss


RE: fast update handlers

Posted by Will Johnson <wj...@GETCONNECTED.COM>.
I guess I was more concerned with doing the frequent commits and how
that would affect the caches.  Say I have 2M docs in my main index but I
want to add docs every 2 seconds all while doing queries.  if I do
commits every 2 seconds I basically loose any caching advantage and my
faceting performance goes down the tube.  If however, I were to add
things to a smaller index and then roll it into the larger one every ~30
minutes then I only take the hit on computing the larger filters caches
on that interval.  Further, if my smaller index were based on a
RAMDirectory instead of a FSDirectory I assume computing the filter sets
for the smaller index should be fast enough even every 2 seconds.

- will




-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
Seeley
Sent: Thursday, May 10, 2007 9:49 AM
To: solr-user@lucene.apache.org
Subject: Re: fast update handlers

On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
> I'm trying to setup a system to have very low index latency (1-2
> seconds) and one of the javadocs intrigued me:
>
> "DirectUpdateHandler2 implements an UpdateHandler where documents are
> added directly to the main Lucene index as opposed to adding to a
> separate smaller index"
>
>
> The plain DirectUpdateHandler also had the same in its docs.  Does
this
> imply that there use to be another handler that could send docs to a
> small/faster index and then merge them in with a larger one or that
> someone could in the future?

That was the original design, before I thought of the current method
in DUH2. DirectUpdateHandler was just meant to get things working to
establish the external interface (it's only for testing... very slow
at overwriting docs).

Adding documents to a separate index and then merging would have no
real indexing speed advantage (it's essentially what Lucene does
anyway when adding to a large index).  There would be some advantage
for index distribution, but it would complicate things greatly.

High latency is caused by segment merges... this would happen when you
periodically had to merge the smaller index into the larger anyway.
You could do some other tricks for more predictable index times... set
a large mergeFactor and then call optimize after you have added your
batch of documents.

Stay tuned though... there has been some work on a lucene patch to do
merging in a background thread.

-Yonik

Re: fast update handlers

Posted by Yonik Seeley <yo...@apache.org>.
On 5/10/07, Will Johnson <wj...@getconnected.com> wrote:
> I'm trying to setup a system to have very low index latency (1-2
> seconds) and one of the javadocs intrigued me:
>
> "DirectUpdateHandler2 implements an UpdateHandler where documents are
> added directly to the main Lucene index as opposed to adding to a
> separate smaller index"
>
>
> The plain DirectUpdateHandler also had the same in its docs.  Does this
> imply that there use to be another handler that could send docs to a
> small/faster index and then merge them in with a larger one or that
> someone could in the future?

That was the original design, before I thought of the current method
in DUH2. DirectUpdateHandler was just meant to get things working to
establish the external interface (it's only for testing... very slow
at overwriting docs).

Adding documents to a separate index and then merging would have no
real indexing speed advantage (it's essentially what Lucene does
anyway when adding to a large index).  There would be some advantage
for index distribution, but it would complicate things greatly.

High latency is caused by segment merges... this would happen when you
periodically had to merge the smaller index into the larger anyway.
You could do some other tricks for more predictable index times... set
a large mergeFactor and then call optimize after you have added your
batch of documents.

Stay tuned though... there has been some work on a lucene patch to do
merging in a background thread.

-Yonik