You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by tsuraan <ts...@gmail.com> on 2009/07/22 18:59:41 UTC

Batch searching

If I understand lucene correctly, when doing multiple simultaneous
searches on the same IndexSearcher, they will basically all do their
own index scans and collect results independently.  If that's correct,
is there a way to batch searches together, so only one index scan is
done?  What I'd like is a Searcher.search(Query[], Collector[]) type
function, where the search only scans over the index once for each
collection of (basically unrelated) searches.  Is that possible, or
does that even make sense?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Shai Erera <se...@gmail.com>.

Queries cannot be ordered "sequentially". Let's say that you run 3 Queries,
w/ one term each "a", "b" and "c". On disk, the posting lists of the terms
can look like this: post1(a), post1(c), post2(a), post1(b), post2(c),
post2(b) etc. They are not guaranteed to be consecutive. The code makes sure
the posting list parts come logically one after the others, by
file-system-wise, they can be completely messed up on disk.

Besides, I'm not sure that what you want will get you anything. If you run 3
queries in parallel, you open 3 file descriptors (at least). Modern machines
can very well read 3 files in parallel, so you won't be seeking around as
much as you think.

I just don't think that application-wise there's something you can do, at
least in this case.

Shai

On Wed, Jul 22, 2009 at 8:15 PM, tsuraan <ts...@gmail.com> wrote:

> > It's not accurate to say that Lucene scans the index for each search.
> > Rather, every Query reads a set of posting lists, each are typically read
> > from disk. If you pass Query[] which have nothing to do in common (for
> > example no terms in common), then you won't gain anything, b/c each Query
> > will already read just the posting lists it needs.
>
> That sounds like a lot of disk seeking, if the terms associated with
> each query don't happen to fall in exact order.  My disks can sustain
> 100+ MB/s sequential read, but if they're seeking that number
> plummets.  Would it be possible to order the queries so that they can
> each read their index information in order, to minimize thrashing?
>
> > If your Query[] contains the exact Query, it's redundant to run all these
> > searches, since they will return the same results every time.
>
> I'm assuming that the queries being run are different.  Caching query
> results would be pretty easy for us though, so if the queries aren't
> different, they could be made to be.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Batch searching

Posted by tsuraan <ts...@gmail.com>.

> It's not accurate to say that Lucene scans the index for each search.
> Rather, every Query reads a set of posting lists, each are typically read
> from disk. If you pass Query[] which have nothing to do in common (for
> example no terms in common), then you won't gain anything, b/c each Query
> will already read just the posting lists it needs.

That sounds like a lot of disk seeking, if the terms associated with
each query don't happen to fall in exact order.  My disks can sustain
100+ MB/s sequential read, but if they're seeking that number
plummets.  Would it be possible to order the queries so that they can
each read their index information in order, to minimize thrashing?

> If your Query[] contains the exact Query, it's redundant to run all these
> searches, since they will return the same results every time.

I'm assuming that the queries being run are different.  Caching query
results would be pretty easy for us though, so if the queries aren't
different, they could be made to be.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Shai Erera <se...@gmail.com>.

It's not accurate to say that Lucene scans the index for each search.
Rather, every Query reads a set of posting lists, each are typically read
from disk. If you pass Query[] which have nothing to do in common (for
example no terms in common), then you won't gain anything, b/c each Query
will already read just the posting lists it needs.

If your Query[] contains the exact Query, it's redundant to run all these
searches, since they will return the same results every time.

Right?

Shai

On Wed, Jul 22, 2009 at 7:59 PM, tsuraan <ts...@gmail.com> wrote:

> If I understand lucene correctly, when doing multiple simultaneous
> searches on the same IndexSearcher, they will basically all do their
> own index scans and collect results independently.  If that's correct,
> is there a way to batch searches together, so only one index scan is
> done?  What I'd like is a Searcher.search(Query[], Collector[]) type
> function, where the search only scans over the index once for each
> collection of (basically unrelated) searches.  Is that possible, or
> does that even make sense?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Batch searching

Posted by Matthew Hall <mh...@informatics.jax.org>.

This was at least one of the threads that was bouncing around... I'm 
fairly sure there were others as well.

Hopefully its worth the read to you ^^

http://www.opensubscriber.com/message/java-dev@lucene.apache.org/11079539.html

Phil Whelan wrote:
> On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall<mh...@informatics.jax.org> wrote:
>   
>> Not sure if this helps you, but some of the issue you are facing seem
>> similar to those in the "real time" search threads.
>>     
>
> Hi Matthew,
>
> Do you have a pointer of where to go to see the "real time" threads?
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Phil Whelan <ph...@gmail.com>.

On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall<mh...@informatics.jax.org> wrote:
> Not sure if this helps you, but some of the issue you are facing seem
> similar to those in the "real time" search threads.

Hi Matthew,

Do you have a pointer of where to go to see the "real time" threads?

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Matthew Hall <mh...@informatics.jax.org>.

Not sure if this helps you, but some of the issue you are facing seem 
similar to those in the "real time" search threads.

Basically their problem involves indexing twitter and the blogosphere, 
and making lucene work for super large data sets like that.

Perhaps some of the discussion in those threads could help.  I'd imagine 
they went over things like massively distributed searches and such, and 
other things that might be of interest to you.

Sorry I can't be of more help than that.

Matt

tsuraan wrote:
>> Out of curiosity, what is the size of your corpus?  How much and how
>> quickly do you expect it to grow?
>>     
>
> in terms of lucene documents, we tend to have in the 10M-100M range.
> Currently we use merging to make larger indices from smaller ones, so
> a single index can have a lot of documents in it, but merging takes a
> long time so I'm trying to test out just using a ton of tiny indices
> to see if the search penalty from doing that is worth the time savings
> from not having to build and optimize large indices.
>
>   
>> I'm just trying to make sure that we are all on the same page here ^^
>>
>> I can see the benefits of doing what you are describing with a very
>> large corpus that is expected to grow at quick rate, but if that's not
>> really your use case, then perhaps it might be worth investigating if a
>> simpler solution would serve you just as well.
>>     
>
> The indices also grow pretty quickly.  We have some cases where we get
> nearly a million new documents per day.  I haven't looked at those
> machines for quite a while, but I guess they'd probably have well over
> a hundred million documents, and still are growing.  We also don't
> have a lot of simultaneous searches yet, but that's changing, so I'm
> getting concerned about how well that's being handled.  We expect that
> we will soon be dealing with tens to hundreds of searches being
> executed simultaneously.
>
>   
>> In the example you provided, you are only talking about searching
>> against 1M documents, which I can guarantee will search with VERY good
>> performance in a single properly setup lucene index.
>>
>> Now if we are talking more on the order of... 100M or more documents you
>> may be onto something.
>>     
>
> But in any case, there isn't currently any framework for making
> multiple searches simultaneously use an index in a coordinated
> fashion.  I was pretty much just planning on adding it to my tests if
> such a thing existed.  Since it doesn't, I guess I'll stuck to
> searching in parallel and hoping that the Linux VFS layer is smart
> enough to keep things fast until I have time to try putting something
> together myself.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by tsuraan <ts...@gmail.com>.

> Out of curiosity, what is the size of your corpus?  How much and how
> quickly do you expect it to grow?

in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, but merging takes a
long time so I'm trying to test out just using a ton of tiny indices
to see if the search penalty from doing that is worth the time savings
from not having to build and optimize large indices.

> I'm just trying to make sure that we are all on the same page here ^^
>
> I can see the benefits of doing what you are describing with a very
> large corpus that is expected to grow at quick rate, but if that's not
> really your use case, then perhaps it might be worth investigating if a
> simpler solution would serve you just as well.

The indices also grow pretty quickly.  We have some cases where we get
nearly a million new documents per day.  I haven't looked at those
machines for quite a while, but I guess they'd probably have well over
a hundred million documents, and still are growing.  We also don't
have a lot of simultaneous searches yet, but that's changing, so I'm
getting concerned about how well that's being handled.  We expect that
we will soon be dealing with tens to hundreds of searches being
executed simultaneously.

> In the example you provided, you are only talking about searching
> against 1M documents, which I can guarantee will search with VERY good
> performance in a single properly setup lucene index.
>
> Now if we are talking more on the order of... 100M or more documents you
> may be onto something.

But in any case, there isn't currently any framework for making
multiple searches simultaneously use an index in a coordinated
fashion.  I was pretty much just planning on adding it to my tests if
such a thing existed.  Since it doesn't, I guess I'll stuck to
searching in parallel and hoping that the Linux VFS layer is smart
enough to keep things fast until I have time to try putting something
together myself.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Matthew Hall <mh...@informatics.jax.org>.

Out of curiosity, what is the size of your corpus?  How much and how 
quickly do you expect it to grow?

I'm just trying to make sure that we are all on the same page here ^^

I can see the benefits of doing what you are describing with a very 
large corpus that is expected to grow at quick rate, but if that's not 
really your use case, then perhaps it might be worth investigating if a 
simpler solution would serve you just as well.

In the example you provided, you are only talking about searching 
against 1M documents, which I can guarantee will search with VERY good 
performance in a single properly setup lucene index.

Now if we are talking more on the order of... 100M or more documents you 
may be onto something.

Well, that's my thoughts anyhow

Matt

tsuraan wrote:
>> If you did this, wouldn't you be binding the processing of the results
>> of all queries to that of the slowest performing one within the collection?
>>     
>
> I would imagine it would, but I haven't seen too much variance between
> lucene query speeds in our data.
>
>   
>> I'm guessing you are trying for some sort of performance benefit by
>> batch processing, but I question whether or not you will actually get
>> more performance by performing your queries in a threaded type
>> environment, and then processing their results as they come in.
>>
>> Could you give a bit more description about what you are actually trying
>> to accomplish, I'm sure this list could help better if we had more
>> information.
>>     
>
> What I'd like to do is build lots of small indices (a few thousand
> documents per index) and put them into HDFS for search distribution.
> We already have our own map-reduce framework for searching, but HDFS
> seems to be a really good fit for an actual storage mechanism.
>
> My concern is that when we have one searcher using thousands of
> HDFS-backed indices, the seeking might get a bit nasty. HDFS
> apparently has pretty good seeking performance, but it really looks
> like it was designed for streaming, so if I could make my searches use
> sequential index access, I would expect better performance than having
> a ton of simultaneous searches making HDFS seek all over the place.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by tsuraan <ts...@gmail.com>.

> If you did this, wouldn't you be binding the processing of the results
> of all queries to that of the slowest performing one within the collection?

I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.

> I'm guessing you are trying for some sort of performance benefit by
> batch processing, but I question whether or not you will actually get
> more performance by performing your queries in a threaded type
> environment, and then processing their results as they come in.
>
> Could you give a bit more description about what you are actually trying
> to accomplish, I'm sure this list could help better if we had more
> information.

What I'd like to do is build lots of small indices (a few thousand
documents per index) and put them into HDFS for search distribution.
We already have our own map-reduce framework for searching, but HDFS
seems to be a really good fit for an actual storage mechanism.

My concern is that when we have one searcher using thousands of
HDFS-backed indices, the seeking might get a bit nasty. HDFS
apparently has pretty good seeking performance, but it really looks
like it was designed for streaming, so if I could make my searches use
sequential index access, I would expect better performance than having
a ton of simultaneous searches making HDFS seek all over the place.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Batch searching

Posted by Matthew Hall <mh...@informatics.jax.org>.

If you did this, wouldn't you be binding the processing of the results 
of all queries to that of the slowest performing one within the collection?

I'm guessing you are trying for some sort of performance benefit by 
batch processing, but I question whether or not you will actually get 
more performance by performing your queries in a threaded type 
environment, and then processing their results as they come in.

Could you give a bit more description about what you are actually trying 
to accomplish, I'm sure this list could help better if we had more 
information.

Matt

tsuraan wrote:
> If I understand lucene correctly, when doing multiple simultaneous
> searches on the same IndexSearcher, they will basically all do their
> own index scans and collect results independently.  If that's correct,
> is there a way to batch searches together, so only one index scan is
> done?  What I'd like is a Searcher.search(Query[], Collector[]) type
> function, where the search only scans over the index once for each
> collection of (basically unrelated) searches.  Is that possible, or
> does that even make sense?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org