You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/12/08 10:04:32 UTC

Re: Lucene performance bottlenecks

(Moving the discussion to nutch-dev, please drop the cc: when responding)

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> It's nice to have these couple percent... however, it doesn't solve 
>> the main problem; I need 50 or more percent increase... :-) and I 
>> suspect this can be achieved only by some radical changes in the way 
>> Nutch uses Lucene. It seems the default query structure is too 
>> complex to get a decent performance.
>
>
> That would certainly help.
>
> For what it's worth, the Internet Archive has ~10M page Nutch indexes 
> that perform adequately.  See:
>
> http://websearch.archive.org/katrina/


Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when 
for any query the response time is well below 1 second. Otherwise the 
service seems sluggish. Response times over 3 seconds are normally not 
acceptable. This is just for a single concurrent query - the number of 
concurrent queries will be a function of the number of concurrent users, 
and the search response time, until it reaches the limit of the number 
of threads on the search servers. Then, the time it takes to return the 
results should give us the maximum concurrent query-per-second estimate.

There is a total of 8,435,793 pages in that index. Here's a short list 
of queries, the number of matching pages, and the average time (I made 
just a couple of tests, no stress-loading ;-) )

* hurricane: 1,273,805 pages, 1.75 seconds.
* katrina: 1,267,240 pages, 1.76 seconds
* gov: 979,820 pages, 1.01 seconds
* hurricane katrina: 773,001 pages, 3.5 seconds (!)
* "hurricane katrina": 600,867 pages, 1.35 seconds
* disaster relief: 205,066 pages, 1.12 seconds
* "disaster relief": 140,007 pages, 0.42 seconds
* hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
* "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
* xys: 227 pages, 0.01 seconds
* xyz: 3,497 pages,  0.005 seconds

>
> The performance is about what you report, but it is quite usable. 
> (Please don't stress-test this server!)  We recently built a ~100M 
> page Nutch index at the Internet Archive that is surprisingly usable 
> on a single CPU.  (This is not yet publicly accessible.)


What I found out is that "usable" depends a lot on how you test it and 
what is your minimum expectation. There are some high-frequency terms 
(and by this I mean terms with frequency around 25%) that will 
consistently cause a dramatic slowdown. Multi-term queries, because of 
the way Nutch expands them into sloppy phrases, may take even more time, 
so even for such relatively small index (from the POV of the whole 
Internet!) the response time may drag into several seconds (try "com").

>
> Perhaps your traffic will be much higher than the Internet Archive's, 
> or you have contractual obligations that specify certain average query 
> performance, but, if not, ~10M pages is quite searchable using Nutch 
> on a single CPU.


I'm not concerned about the traffic - I believe the distributed search 
can handle a lot of traffic if need be. What I'm concerned about is the 
maximum response time from individual search servers. This is because 
the front-end response time is determined by the longest response time 
from any of the (active) search servers. Response times over 1 sec. from 
a 10 mln collection are IMHO not adequate, because the service will 
appear slow. Response times over several seconds would mean that users 
would say goodbye and never return... ;-)

If 10 mln docs is too much for a single server to meet such a 
performance target, then this explodes the total number of servers 
required to handle Internet-wide collections of billions pages...

So, I think it's time to re-think the query structure and scoring 
mechanisms, in order to simplify the Lucene queries generated by Nutch - 
or to do some other tricks...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is 
>> when for any query the response time is well below 1 second. 
>> Otherwise the service seems sluggish. Response times over 3 seconds 
>> are normally not acceptable.
>
>
> It depends.  Clearly an average response time of less than 1 second is 
> better than an average response time of 3 seconds.  There is no 
> argument.  That is a more useful search engine.  But a search engine 
> with a 3-second average response time is still much better than no 
> search engine at all.  If an institution cannot afford to guarantee 1 
> second average response time, must it not run a search engine?  For 
> low-traffic, non-commercial search engines, sluggishness is not a 
> fatal fault.

[...]

Yes, I fully agree with your arguments here - please accept my apologies 
if I came across as whining or complaining about that particular 
installation - quite the contrary, I think it's a unique and useful service.

My point was how to improve Nutch response time for large collections 
and for commercial situations, where the service you offer has to meet 
demanding requirements for maximum response time... the response times 
from this particular installation served just as an example of the issue.

>> There is a total of 8,435,793 pages in that index. Here's a short 
>> list of queries, the number of matching pages, and the average time 
>> (I made just a couple of tests, no stress-loading ;-) )
>>
>> * hurricane: 1,273,805 pages, 1.75 seconds.
>> * katrina: 1,267,240 pages, 1.76 seconds
>> * gov: 979,820 pages, 1.01 seconds
>
>
> These are some of the slowest terms in this index.
>
>> * hurricane katrina: 773,001 pages, 3.5 seconds (!)
>
>
> This is not a very interesting query for this collection...


That's not the point - the point is that this is a valid query that 
users may enter, and the search engine should be prepared to return 
results within certain acceptable time limits.


>>  even more time, so even for such relatively small index (from the 
>> POV of the whole Internet!) the response time may drag into several 
>> seconds (try "com").
>
>
> How often to you search for "com"?
>

Ugh... Again, that's beyond the point. It's a valid query, and a simple 
one at that, and the response time was awful.

>> Response times over several seconds would mean that users would say 
>> goodbye and never return... ;-)
>
>
> So, tell me, where will these users then search for archived web 
> content related to hurricane katrina?  There is no other option.  If 
> this were a competitive commercial offering, then some sluggishness 
> would indeed be unacceptable, and ~10M pages in a single index might 
> be too many on today's processors.  But in a non-profit unique 
> offering, I don't think this is unacceptable.  Not optimal, but 
> workable.  Should the archive refuse to make this content searchable 
> until they have faster or more machines, or until Nutch is faster?  I 
> don't think so.
>

I hope I explained that I didn't complain about this particular 
installation. I just used this installation to illustrate the problem 
that I see also in other installation, where the demands are much higher 
and much more difficult to meet.


>> If 10 mln docs is too much for a single server to meet such a 
>> performance target, then this explodes the total number of servers 
>> required to handle Internet-wide collections of billions pages...
>>
>> So, I think it's time to re-think the query structure and scoring 
>> mechanisms, in order to simplify the Lucene queries generated by 
>> Nutch - or to do some other tricks...
>
>
> I think "other tricks" will be more fruitful.  Lucene is pretty well 
> optimized, and I don't think qualitative improvements can be had by 
> simplifying the queries without substantially reducing their 
> effectiveness.
>
> The trick that I think would be most fruitful is something like what 
> Torsten Suel describes in his paper titled "Optimized Query Execution 
> in Large Search Engines with Global Page Ordering".
>
> http://cis.poly.edu/suel/papers/order.pdf
> http://cis.poly.edu/suel/talks/order-vldb.ppt
>
> I beleive all of the major search engines implement something like 
> this, where heuristics are used to avoid searching the complete 
> index.  (We certainly did so at Excite.)  The results are no longer 
> guaranteed to always be the absolute highest-scoring, but in most 
> cases are nearly identical.
>
> Implementing something like this for Lucene would not be too 
> difficult.  The index would need to be re-sorted by document boost: 
> documents would be re-numbered so that highly-boosted documents had 
> low document numbers.  Then a HitCollector can simply stop searching 
> once a given number of hits are found.


Now we are talking .. ;-) This sounds relatively simple and worth 
trying. Thanks for the pointers!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Doug Cutting wrote:
>>
>>> The graph just shows that they differ, not how much better or worse 
>>> they are, since the baseline is not perfect.  When the top-10 is 50% 
>>> different, are those 5 different hits markedly worse matches to your 
>>> eye than the five they've displaced, or are they comparable?  That's 
>>> what really matters.
>>
>>
>> Hmm. I'm not sure I agree with this. Your reasoning would be true if 
>> we were changing the ranking formula. But the goal IMHO with these 
>> patches is to return equally complete results, using the same ranking 
>> formula.
>
>
> But we should not assume that the ranking formula is perfect.  Imagine 
> a case where the high-order bits of the score are correct and the 
> low-order bits are random.  Then an optimization which changes local 
> orderings does not actually affect result quality.


Yes, that's true, I could accept that. In these tests the score delta 
was something like 20 (between the hit #1 and hit #100), and the score 
was falling down rapidly after the first 10 or 20 results. Now, the 
problem is that many results _within_ this range (i.e. still within the 
area with large score deltas) were missing. And this suggests that the 
differences were also on the high-order bits.

Please re-run the script on your index, using typical queries, and check 
the results. It's possible that I made a mistake somewhere, it would be 
good to confirm at least the trends in the raw results.

>
>> I specifically avoided using normalized scores, instead using the 
>> absolute scores in TopDocs. And the absolute scores in both cases are 
>> exactly the same, for those results that are present.
>>
>> What is wrong is that some results that should be there (judging by 
>> the ranking) are simply missing. So, it's about the recall, and the 
>> baseline index gives the best estimate.
>
>
> Yes, this optimization, by definition, hurts recall.  The only 
> question is does it substantially hurt relevance at, e.g., 10 hits.  
> If the top-10 are identical then the answer is easy: no, it does not.  
> But if they differ, we can only answer this by looking at results.  
> Chances are they're worse, but how much?  Radically?  Slightly?  
> Noticiably?


The paper by Suel et al. which you referred to claims the top-100 as 
high as 98% after optimizations. What I observed were values between 
0-60%, but going above this level caused a heavy performance loss.

>
>>> What part of Nutch are you trying to avoid?  Perhaps you could try 
>>> measuring your Lucene-only benchmark against a Nutch-based one.  If 
>>> they don't differ markedly then you can simply use Nutch, which 
>>> makes it a stronger benchmark.  If they differ, then we should 
>>> figure out why.
>>
>>
>> Again, I don't see it this way. Nutch results will always be worse 
>> than pure Lucene, because of the added layers. If I can't improve the 
>> performance in Lucene code (which takes > 85% time for every query) 
>> then no matter how well optimized Nutch code is it won't get far.
>
>
> But we're mostly modifying Nutch's use of Lucene, not modifying 
> Lucene.  So measuring Lucene alone won't tell you everything, and 
> you'll keep having to port Nutch stuff.  If you want to, e.g., replay 
> a large query log to measure average performance, then you'll need 
> things like auto-filterization, n-grams, query plugins, etc., no?


Perhaps we misunderstood each other - I'm using an index built by Nutch, 
there's no substitute for that, I agree. It was just more convenient for 
me to skip all Nutch classes for _querying_ alone, because it was easier 
to control the exact final form of Lucene query - especially if you want 
to experiment quickly with a lot of variables that are not (yet) 
parametrized through the config files. In the end, you end up with a 
plain Lucene query, only then you don't know exactly how much time was 
spent on translating, building filters, etc. *shrug* You can do it 
either way, I agree.

>
>>>> In several installations I use smaller values of slop (around 
>>>> 20-40). But this is motivated by better quality matches, not by 
>>>> performance, so I didn't test for this...
>>>
>>>
>>> But that's a great reason to test for it!  If lower slop can improve 
>>> result quality, then we should certainly see if it also makes 
>>> optimizations easier.
>>
>>
>> I forgot to mention this - the tests I ran already used the smaller 
>> values: the slop was set to 20.
>
>
> Are they different if the slop is Integer.MAX_VALUE?  It would be 
> really good to determine what causes results to diverge, whether it is 
> multiple terms (probably not) phrases (probably) and/or slop 
> (perhaps).  Chances are that the divergence is bad, that results are 
> adversely affected, and that we need to try to fix it.  But to do so 
> we'll need to understand it.


Agreed. I'll try to re-run the tests with queries that set a different 
slop value, or omit the phrases completely (and it's quite easy to do 
this with my approach, just use a different translated query on the 
cmd-line ;-) ).

>
>> That's another advantage of using Lucene directly in this script - 
>> you can provide any query structure on the command-line without 
>> changing the code in Nutch.
>
>
> But that just means that we should set the SLOP constant in 
> BasicQueryFilter.java from a configuration property, and permit the 
> setting of configuration properties from the command line, no?


Well, if you want to quickly experiment with radically different query 
translation, then no.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Doug Cutting wrote:
>> The graph just shows that they differ, not how much better or worse 
>> they are, since the baseline is not perfect.  When the top-10 is 50% 
>> different, are those 5 different hits markedly worse matches to your 
>> eye than the five they've displaced, or are they comparable?  That's 
>> what really matters.
> 
> Hmm. I'm not sure I agree with this. Your reasoning would be true if we 
> were changing the ranking formula. But the goal IMHO with these patches 
> is to return equally complete results, using the same ranking formula.

But we should not assume that the ranking formula is perfect.  Imagine a 
case where the high-order bits of the score are correct and the 
low-order bits are random.  Then an optimization which changes local 
orderings does not actually affect result quality.

> I specifically avoided using normalized scores, instead using the 
> absolute scores in TopDocs. And the absolute scores in both cases are 
> exactly the same, for those results that are present.
> 
> What is wrong is that some results that should be there (judging by the 
> ranking) are simply missing. So, it's about the recall, and the baseline 
> index gives the best estimate.

Yes, this optimization, by definition, hurts recall.  The only question 
is does it substantially hurt relevance at, e.g., 10 hits.  If the 
top-10 are identical then the answer is easy: no, it does not.  But if 
they differ, we can only answer this by looking at results.  Chances are 
they're worse, but how much?  Radically?  Slightly?  Noticiably?

>> What part of Nutch are you trying to avoid?  Perhaps you could try 
>> measuring your Lucene-only benchmark against a Nutch-based one.  If 
>> they don't differ markedly then you can simply use Nutch, which makes 
>> it a stronger benchmark.  If they differ, then we should figure out why.
> 
> Again, I don't see it this way. Nutch results will always be worse than 
> pure Lucene, because of the added layers. If I can't improve the 
> performance in Lucene code (which takes > 85% time for every query) then 
> no matter how well optimized Nutch code is it won't get far.

But we're mostly modifying Nutch's use of Lucene, not modifying Lucene. 
  So measuring Lucene alone won't tell you everything, and you'll keep 
having to port Nutch stuff.  If you want to, e.g., replay a large query 
log to measure average performance, then you'll need things like 
auto-filterization, n-grams, query plugins, etc., no?

>>> In several installations I use smaller values of slop (around 20-40). 
>>> But this is motivated by better quality matches, not by performance, 
>>> so I didn't test for this...
>>
>> But that's a great reason to test for it!  If lower slop can improve 
>> result quality, then we should certainly see if it also makes 
>> optimizations easier.
> 
> I forgot to mention this - the tests I ran already used the smaller 
> values: the slop was set to 20.

Are they different if the slop is Integer.MAX_VALUE?  It would be really 
good to determine what causes results to diverge, whether it is multiple 
terms (probably not) phrases (probably) and/or slop (perhaps).  Chances 
are that the divergence is bad, that results are adversely affected, and 
that we need to try to fix it.  But to do so we'll need to understand it.

> That's another advantage of using Lucene directly in this script - you 
> can provide any query structure on the command-line without changing the 
> code in Nutch.

But that just means that we should set the SLOP constant in 
BasicQueryFilter.java from a configuration property, and permit the 
setting of configuration properties from the command line, no?

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>>>  . How were the queries generated?  From a log or randomly?
>>
>>
>> Queries have been picked up manually, to test the worst performing 
>> cases from a real query log.
>
>
> So, for example, the 50% error rate might not be typical, but could be 
> worst-case.


Yes, that's true.

>
>>>  . When results differed greatly, did they look a lot worse?
>>
>>
>> Yes. E.g. see the differences for MAX_HITS=10000
>
>
> The graph just shows that they differ, not how much better or worse 
> they are, since the baseline is not perfect.  When the top-10 is 50% 
> different, are those 5 different hits markedly worse matches to your 
> eye than the five they've displaced, or are they comparable?  That's 
> what really matters.
>

Hmm. I'm not sure I agree with this. Your reasoning would be true if we 
were changing the ranking formula. But the goal IMHO with these patches 
is to return equally complete results, using the same ranking formula.

I specifically avoided using normalized scores, instead using the 
absolute scores in TopDocs. And the absolute scores in both cases are 
exactly the same, for those results that are present.

What is wrong is that some results that should be there (judging by the 
ranking) are simply missing. So, it's about the recall, and the baseline 
index gives the best estimate.

>> I actually forgot to write that I don't use any of Nutch code. Early 
>> on I decided to eliminate this part in order to get first the raw 
>> performance from Lucene - but still using the Lucene queries 
>> corresponding to translated Nutch queries.
>
>
> What part of Nutch are you trying to avoid?  Perhaps you could try 
> measuring your Lucene-only benchmark against a Nutch-based one.  If 
> they don't differ markedly then you can simply use Nutch, which makes 
> it a stronger benchmark.  If they differ, then we should figure out why.


Again, I don't see it this way. Nutch results will always be worse than 
pure Lucene, because of the added layers. If I can't improve the 
performance in Lucene code (which takes > 85% time for every query) then 
no matter how well optimized Nutch code is it won't get far.

So, I'm reproducing the same queries that Nutch passes to Lucene, in 
order to simulate the typical load generated by Nutch, but avoiding any 
non-essential code that could skew the results. What's wrong with that? 
I think this approach makes the benchmark easier to understand and 
isolated from non-essential factors. When we reach a significant 
improvement on this level, of course we should move to use Nutch code to 
test how well it works on the top.

>
>> In several installations I use smaller values of slop (around 20-40). 
>> But this is motivated by better quality matches, not by performance, 
>> so I didn't test for this...
>
>
> But that's a great reason to test for it!  If lower slop can improve 
> result quality, then we should certainly see if it also makes 
> optimizations easier.


I forgot to mention this - the tests I ran already used the smaller 
values: the slop was set to 20.

That's another advantage of using Lucene directly in this script - you 
can provide any query structure on the command-line without changing the 
code in Nutch.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
>>  . How were the queries generated?  From a log or randomly?
> 
> Queries have been picked up manually, to test the worst performing cases 
> from a real query log.

So, for example, the 50% error rate might not be typical, but could be 
worst-case.

>>  . When results differed greatly, did they look a lot worse?
> 
> Yes. E.g. see the differences for MAX_HITS=10000

The graph just shows that they differ, not how much better or worse they 
are, since the baseline is not perfect.  When the top-10 is 50% 
different, are those 5 different hits markedly worse matches to your eye 
than the five they've displaced, or are they comparable?  That's what 
really matters.

> I actually forgot to write that I don't use any of Nutch code. Early on 
> I decided to eliminate this part in order to get first the raw 
> performance from Lucene - but still using the Lucene queries 
> corresponding to translated Nutch queries.

What part of Nutch are you trying to avoid?  Perhaps you could try 
measuring your Lucene-only benchmark against a Nutch-based one.  If they 
don't differ markedly then you can simply use Nutch, which makes it a 
stronger benchmark.  If they differ, then we should figure out why.

> In several installations I use smaller values of slop (around 20-40). 
> But this is motivated by better quality matches, not by performance, so 
> I didn't test for this...

But that's a great reason to test for it!  If lower slop can improve 
result quality, then we should certainly see if it also makes 
optimizations easier.

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> I tested it on a 5 mln index.
>
>
> Thanks, this is great data!
>
> Can you please tell a bit more about the experiments?  In particular:
>  . How were scores assigned to pages?  Link analysis?  log(number of 
> incoming links) or OPIC?


log()

>  . How were the queries generated?  From a log or randomly?


Queries have been picked up manually, to test the worst performing cases 
from a real query log.

>  . How many queries did you test with?


A couple, not many. This of course needs to be improved, but I didn't 
want to spend too much time until we polish the code.

>  . When results differed greatly, did they look a lot worse?


Yes. E.g. see the differences for MAX_HITS=10000

>
> My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.
>
>> For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
>> queries, which executed in e.g. 500 ms now executed in 10-20ms 
>> (perfRate=40). Following the intuition, performance drops as we 
>> increase MAX_HITS, until it reaches a more or less original values 
>> (perfRate=1) for MAX_HITS=300000 (for a 5 mln doc index). After that, 
>> increasing MAX_HITS actually worsens the performance (perfRate << 1) 
>> - which can be explained by the fact that the standard HitCollector 
>> doesn't collect as many documents, if they score too low.
>
>
> This doesn't make sense to me.  It should never be slower.  We're not 
> actually keeping track of any more hits, only stopping earlier.


Yes, that's puzzling... Please read below.

>
>> * Two-term Nutch queries result in complex Lucene BooleanQueries over 
>> many index fields, includng also PhraseQueries. These fared much 
>> worse than single-term queries: actually, the topN values were very 
>> low until MAX_HITS was increased to large values, and then all of a 
>> sudden all topN-s flipped into the 80-90% ranges.
>
>
> It would be interesting to try altering the generated query, to see if 
> it is the phrases or simply multiple terms which cause problems.  To 
> do this, one could hack the query-basic plugin, or simply alter query 
> boost parameters.  This 


I actually forgot to write that I don't use any of Nutch code. Early on 
I decided to eliminate this part in order to get first the raw 
performance from Lucene - but still using the Lucene queries 
corresponding to translated Nutch queries.

Please see the attached scripts. You will need to put bsh and lucene on 
your classpath, as well as compile and add the LimitingCollector classes 
(it was not possible to implement this inside bsh scripts, due to 
vagaries of class overloading in BeanShell). You need to customize the 
paths to both indexes inside the script. Then run:

    java -cp bsh.jar:lucene.jar:LimitedCollector.jar bsh.Interpreter 
test-opt.sh "+url:information" 10000

The script will iterate 10 times for a warm-up, and then report the 
timing, and matching, and topN results. You will get an HTML table with 
the first 100 results.

> would help us figure out where the optimization is failing.  Suel used 
> multi-term queries, but not phrases, so we expect that the phrases are 
> causing the problem, but it would be good to see for certain.  We've 
> also never tuned Nutch's phrase matching, so it's also possible that 
> we may sometimes over-emphasize the phrase component in scores.  For 
> example, a slop of 10 might give better results and/or be more 
> amenable to this optimization.


In several installations I use smaller values of slop (around 20-40). 
But this is motivated by better quality matches, not by performance, so 
I didn't test for this...

>
>> I also noticed that the values of topN depended strongly on the 
>> document frequency of terms in the query. For a two-term query, where 
>> both terms have average document frequency, the topN values start 
>> from ~50% for low MAX_HITS. For a two-term query where one of the 
>> terms has a very high document frequency, the topN values start from 
>> 0% for low MAX_HITS. See the spreadsheet for details.
>
>
> Were these actually useful queries?  For example, I would not be 
> concerned if results differed greatly for a query like 'to be', since 
> that's not a very useful query.  Try searching for 'the the' on Google.
>

No, these were not stop-words. "High DF" is around 0.3, i.e. the term 
occurs in 30% documents; but it's still a valid and useful term.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I tested it on a 5 mln index.

Thanks, this is great data!

Can you please tell a bit more about the experiments?  In particular:
  . How were scores assigned to pages?  Link analysis?  log(number of 
incoming links) or OPIC?
  . How were the queries generated?  From a log or randomly?
  . How many queries did you test with?
  . When results differed greatly, did they look a lot worse?

My attempt to sort a 38M page index failed with OutOfMemory.  Sigh.

> For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
> queries, which executed in e.g. 500 ms now executed in 10-20ms 
> (perfRate=40). Following the intuition, performance drops as we increase 
> MAX_HITS, until it reaches a more or less original values (perfRate=1) 
> for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing 
> MAX_HITS actually worsens the performance (perfRate << 1) - which can be 
> explained by the fact that the standard HitCollector doesn't collect as 
> many documents, if they score too low.

This doesn't make sense to me.  It should never be slower.  We're not 
actually keeping track of any more hits, only stopping earlier.

> * Two-term Nutch queries result in complex Lucene BooleanQueries over 
> many index fields, includng also PhraseQueries. These fared much worse 
> than single-term queries: actually, the topN values were very low until 
> MAX_HITS was increased to large values, and then all of a sudden all 
> topN-s flipped into the 80-90% ranges.

It would be interesting to try altering the generated query, to see if 
it is the phrases or simply multiple terms which cause problems.  To do 
this, one could hack the query-basic plugin, or simply alter query boost 
parameters.  This would help us figure out where the optimization is 
failing.  Suel used multi-term queries, but not phrases, so we expect 
that the phrases are causing the problem, but it would be good to see 
for certain.  We've also never tuned Nutch's phrase matching, so it's 
also possible that we may sometimes over-emphasize the phrase component 
in scores.  For example, a slop of 10 might give better results and/or 
be more amenable to this optimization.

> I also noticed that the values of topN depended strongly on the document 
> frequency of terms in the query. For a two-term query, where both terms 
> have average document frequency, the topN values start from ~50% for low 
> MAX_HITS. For a two-term query where one of the terms has a very high 
> document frequency, the topN values start from 0% for low MAX_HITS. See 
> the spreadsheet for details.

Were these actually useful queries?  For example, I would not be 
concerned if results differed greatly for a query like 'to be', since 
that's not a very useful query.  Try searching for 'the the' on Google.

Thanks!

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Ok, I just tested IndexSorter for now. It appears to work correctly, 
>> at least I get exactly the same results, with the same scores and the 
>> same explanations, if I run the smae queries on the original and on 
>> the sorted index.
>
>
> Here's a more complete version, still mostly untested.  This should 
> make searches faster.  We'll see how much good the results are...
>
> This includes a patch to Lucene to make it easier to write hit 
> collectors that collect TopDocs.
>
> I'll test this on a 38M document index tomorrow.


I tested it on a 5 mln index.

The original index is considered the "baseline", i.e. it represents 
normative values for scoring and ranking. These results are compared to 
results from the optimized index, and scores and positions of hits are 
also recorded. Finally, these two lists are matched, and relative 
differences in scoring and ranking are calculated.

At the end, I calculate the top10 %, top50% and top100%, defined as a 
percent of the top-N hits from the optimized index, which match the 
top-N hits from the baseline index. Ideally, all these measures should 
be 100%, i.e. all top-N hits from the optimized index should match 
corresponding top-N hits from the baseline index.

One variable which affects greatly both the recall and the performance 
is the maximum number of hits considered by the TopDocCollector. In my 
tests I used values between 1,000 up to 500,000 (which represents 1/10th 
of the full index in my case).

Now, the results. I collected all test results in a spreadsheet 
(OpenDocument or PDF format), you can download it from:

    http://www.getopt.org/nutch/20051214/nutchPerf.ods
    http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
queries, which executed in e.g. 500 ms now executed in 10-20ms 
(perfRate=40). Following the intuition, performance drops as we increase 
MAX_HITS, until it reaches a more or less original values (perfRate=1) 
for MAX_HITS=300000 (for a 5 mln doc index). After that, increasing 
MAX_HITS actually worsens the performance (perfRate << 1) - which can be 
explained by the fact that the standard HitCollector doesn't collect as 
many documents, if they score too low.

* Single-term Nutch queries (i.e. which do not produce Lucene 
PhraseQueries) yield relatively good values of topN, even for relatively 
small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. 
The minimum useful value for my index was MAX_HITS=10000 (perfRate=30), 
and this yields quite acceptable top10=90%, but less acceptable top50 
and top100. Please see the spreadsheet for details.

* Two-term Nutch queries result in complex Lucene BooleanQueries over 
many index fields, includng also PhraseQueries. These fared much worse 
than single-term queries: actually, the topN values were very low until 
MAX_HITS was increased to large values, and then all of a sudden all 
topN-s flipped into the 80-90% ranges.

I also noticed that the values of topN depended strongly on the document 
frequency of terms in the query. For a two-term query, where both terms 
have average document frequency, the topN values start from ~50% for low 
MAX_HITS. For a two-term query where one of the terms has a very high 
document frequency, the topN values start from 0% for low MAX_HITS. See 
the spreadsheet for details.

Conclusions: more work is needed... ;-)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I'll test it soon - one comment, though. Currently you use a subclass of 
> RuntimeException to stop the collecting. I think we should come up with 
> a better mechanism - throwing exceptions is too costly.

I thought about this, but I could not see a simple way to achieve it. 
And one exception thrown per query is not very expensive.  But it is bad 
style.  Sigh.

> Perhaps the 
> HitCollector.collect() method should return a boolean to signal whether 
> the searcher should continue working.

We don't really want a HitCollector in this case: we want a TopDocs.  So 
the patch I made is required: we need to extend the HitCollector that 
implements TopDocs-based searching.

Long-term, to avoid the 'throw', we'd need to also:

1. Change:
      TopDocs Searchable.search(Query, Filter, int numHits)
    to:
      TopDocs Searchable.search(Query, Filter, int numHits, maxTotalHits)

2. Add, for back-compatibility:
      TopDocs Searcher.search(Query, Filter, int numHits) {
        return search(query, filter, numHits, Integer.MAX_VALUE);
      }

3. Add a new method:
      /** Return false to stop hit processing. */
      boolean HitCollector.processHit(int doc, float score) {
        collect(doc, score);   // for back-compatibility
        return true;
      }
    Then change all calls to HitCollector.collect to instead call this,
    and deprecate HitCollector.collect.

I think that would do it.  But is it worth it?

In the past I've frequently wanted to be able to extend TopDocs-based 
searching, so I think the Lucene patch I've constructed so far is 
generally useful.

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Ok, I just tested IndexSorter for now. It appears to work correctly, 
>> at least I get exactly the same results, with the same scores and the 
>> same explanations, if I run the smae queries on the original and on 
>> the sorted index.
>
>
> Here's a more complete version, still mostly untested.  This should 
> make searches faster.  We'll see how much good the results are...
>
> This includes a patch to Lucene to make it easier to write hit 
> collectors that collect TopDocs.
>
> I'll test this on a 38M document index tomorrow.


I'll test it soon - one comment, though. Currently you use a subclass of 
RuntimeException to stop the collecting. I think we should come up with 
a better mechanism - throwing exceptions is too costly. Perhaps the 
HitCollector.collect() method should return a boolean to signal whether 
the searcher should continue working.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Ok, I just tested IndexSorter for now. It appears to work correctly, at 
> least I get exactly the same results, with the same scores and the same 
> explanations, if I run the smae queries on the original and on the 
> sorted index.

Here's a more complete version, still mostly untested.  This should make 
searches faster.  We'll see how much good the results are...

This includes a patch to Lucene to make it easier to write hit 
collectors that collect TopDocs.

I'll test this on a 38M document index tomorrow.

Cheers,

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> Shouldn't this be combined with a HitCollector that collects only the 
>> first-n matches? Otherwise we still need to scan the whole posting 
>> list...
>
>
> Yes.  I was just posting the work-in-progress.


Ok, I just tested IndexSorter for now. It appears to work correctly, at 
least I get exactly the same results, with the same scores and the same 
explanations, if I run the smae queries on the original and on the 
sorted index. For now, the query response time is identical as far as I 
can tell.

> We will also need to estimate the total number of matches by 
> extrapolating linearly from the maximum doc id processed.


...which should be reported by the custom HitCollector, right?

> Finally, it is probably rather slow for large indexes, whose .fdt 
> won't fit in memory.  A simple way to improve that might be to use 
> Similarity.floatToByte-encoded floats when sorting (e.g., the norm 
> from an untokenized field) so that 


Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min. 
to complete, this IndexSorter took over 1 hour...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Shouldn't this be combined with a HitCollector that collects only the 
> first-n matches? Otherwise we still need to scan the whole posting list...

Yes.  I was just posting the work-in-progress.

We will also need to estimate the total number of matches by 
extrapolating linearly from the maximum doc id processed.  Finally, it 
is probably rather slow for large indexes, whose .fdt won't fit in 
memory.  A simple way to improve that might be to use 
Similarity.floatToByte-encoded floats when sorting (e.g., the norm from 
an untokenized field) so that documents whose boosts are close are not 
re-ordered.  I'll start work on these in the morning.  (It is currently 
my middle-of-night.)

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Andrzej Bialecki wrote:
>
>> By all means please start, this is still near the limits of my 
>> knowledge of Lucene... ;-)
>
>
> Attached is a class which sorts a Nutch index by boost.  I have only 
> tested it on a ~100 page index, where it appears to work correctly. 
> Please tell me how it works for you.


Shouldn't this be combined with a HitCollector that collects only the 
first-n matches? Otherwise we still need to scan the whole posting list...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> By all means please start, this is still near the limits of my knowledge 
> of Lucene... ;-)

Attached is a class which sorts a Nutch index by boost.  I have only 
tested it on a ~100 page index, where it appears to work correctly. 
Please tell me how it works for you.

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> By all means please start, this is still near the limits of my knowledge 
> of Lucene... ;-)

Okay, I'll try to get something working fairly soon.

Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Yes, this is why I was discouraged and stopped working on this.
>
> However I am now hopeful that sorting the entire index by page score 
> and using top-1000 might work well with Nutch queries, since page 
> score is field-independent, and I think fields cause the problems.  
> Plus, this would be a lot simpler than the cross-field summing 
> described above.
>
> I can start writing an index-sorter today, unless you are already 
> working on this.  If you have an evaluation framework, that would be 
> great.


By all means please start, this is still near the limits of my knowledge 
of Lucene... ;-)

My testing framework consists of a bunch of Beanshell scripts, and a 
test index that I know of (which I'm not at liberty to share). But I can 
prepare another index, based e.g. on the Reuters corpus, and clean up 
the scripts somewhat.

I'm interested in following this up and contributing to a usable conclusion.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> For single term queries (in Nutch - in Lucene they are rewritten to 
> complex BooleanQueries), the hit lists are nearly identical for the 
> first 10 hits, then they start to differ more and more as you progress 
> along the original hit list. This is not so surprising - after all, this 
> "optimization" operation is lossy. Still, the differences are higher 
> than it was reported in that paper by Suel (but they used a different 
> algorithm to select the postings) - Suel et al. were able to achieve 98% 
> accuracy for the top-10 results, _including_ multi-term boolean queries.

A better way to prune the index might be to look at the sum of 
query-boosted scores from the content, title, url and anchor fields for 
each term.  One could process four TermEnums in parallel, one for each 
field, and include documents in the index if the sum places them in the 
top 10%.  But this is rather complex, and I am hopeful that a simpler 
method may work better.

> For multi-term Nutch queries, which are rewritten to a combination of 
> boolean queries and sloppy phrase queries, the effects are disastrous - 

Yes, this is why I was discouraged and stopped working on this.

However I am now hopeful that sorting the entire index by page score and 
using top-1000 might work well with Nutch queries, since page score is 
field-independent, and I think fields cause the problems.  Plus, this 
would be a lot simpler than the cross-field summing described above.

I can start writing an index-sorter today, unless you are already 
working on this.  If you have an evaluation framework, that would be great.

Doug

IndexOptimizer (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> The IndexOptimizer.java class in the searcher package was an old 
> attempt to create something like what Suel calls "fancy postings".  It 
> creates an index with the top 10% scoring postings.  Since documents 
> are not renumbered one can intermix postings from this with the full 
> index.  So for example, one can first try searching using this index 
> for terms that occur more than, e.g., 10k times, and use the full 
> index for rarer words.  If that does not find 1000 hits then the full 
> index must be searched.  Such an approach can be combined with using a 
> pre-sorted index.


I tested the IndexOptimizer, comparing the result lists from the 
original and the optimized index.

The trick in the original IndexOptimizer to avoid copying field data 
doesn't work anymore - it throws exceptions during segment merging. I 
"fixed it" by commenting out overriden numDocs() and maxDoc() in 
OptimizingReader.

Then, after analyzing the explanations I came to conclusion that the 
IDFs are calculated based on the original ratios of docFreq/numDocs, so 
I needed to modify Similarity.idf() to account for the changed 
docFreq/numDocs (by FRACTION).

The results, speed-wise, were very encouraging - however, after 
comparing the hit lists I discovered that they differed significantly.

For single term queries (in Nutch - in Lucene they are rewritten to 
complex BooleanQueries), the hit lists are nearly identical for the 
first 10 hits, then they start to differ more and more as you progress 
along the original hit list. This is not so surprising - after all, this 
"optimization" operation is lossy. Still, the differences are higher 
than it was reported in that paper by Suel (but they used a different 
algorithm to select the postings) - Suel et al. were able to achieve 98% 
accuracy for the top-10 results, _including_ multi-term boolean queries.

For multi-term Nutch queries, which are rewritten to a combination of 
boolean queries and sloppy phrase queries, the effects are disastrous - 
I could barely manage to get some of the matching hits within the first 
300 results, and their order was completely at odds with the original 
hit list. This is probably due to the scoring of sloppy phrases - I need 
to modify the test scripts to compare the explanations from matching 
results...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Posted by Jérôme Charron <je...@gmail.com>.
> The total number of hits (approx) is 2,780,000,000. BTW, I find it
> curious that the last 3 or 6 digits always seem to be zeros ... there's
> some clever guesstimation involved here. The fact that Google Suggest is
> able to return results so quickly would support this suspicion.
>
For more informations about "fake" Google counts, I suggest you to take a
look to some
tests performed by Jean Véronis, a French academic :
http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dawid Weiss wrote:

>
> Hi Andrzej,
>
> This was a very interesting experiment -- thanks for sharing the 
> results with us.
>
>> The last range was the maximum in this case - Google wouldn't display 
>> any hit above 652 (which I find curious, too - because the total 
>> number of hits is, well, significantly higher - and Google claims to 
>> return up to the first 1000 results).
>
>
> I believe this may have something to do with the way Google compacts 
> URLs. My guess is that initially a 1000 results is found and ranked. 
> Then pruning is performed on that, leaving just a subset of results 
> for the user to select from.
>

That was my guess, too ...

> Sorry, my initial intuition proved wrong -- there is no clear logic 
> behind the maximum limit of results you can see (unless you can find 
> some logic in the fact that I can see _more_ results when I _exclude_ 
> repeated ones from the total).


Well, trying not to sound too much like Spock... Fascinating :-), but 
the only logical conclusion is that at the user end we never deal with 
any hard results calculated directly from the hypothetical "main index", 
we deal just with rough estimates from the "estimated indexes". These 
change in time, and perhaps even with the group of servers that answered 
this particular query... My guess is that there could be different 
"estimated" indexes prepared for different values of the main boolean 
parameters, like filter=0...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Andrzej,

This was a very interesting experiment -- thanks for sharing the results 
with us.

> The last range was the maximum in this case - Google wouldn't display 
> any hit above 652 (which I find curious, too - because the total number 
> of hits is, well, significantly higher - and Google claims to return up 
> to the first 1000 results).

I believe this may have something to do with the way Google compacts 
URLs. My guess is that initially a 1000 results is found and ranked. 
Then pruning is performed on that, leaving just a subset of results for 
the user to select from.

If you try this, self-indulging, query (with filtering enabled):

http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=active&start=900

You get: "Results 781 - 782 of about 61,700"

Now try disabling filtering:

http://www.google.com/search?as_q=dawid+weiss&num=10&hl=en&as_qdr=all&as_occt=any&as_dt=i&safe=images&start=900

Then you get: Results 781 - 782 of about 65,500

Hmmm... still the same number of available results, but the total 
estimate is higher.

So far I used URL parameters found on the "advanced" search page. I 
tried to "display the omitted search results", as Google suggested. 
Interestingly, this lead to:

http://www.google.com/search?q=dawid+weiss&hl=en&shb=t&filter=0&start=900

"Results 541 - 549 of about 65,400 "

And that's the maximum you can get.

Sorry, my initial intuition proved wrong -- there is no clear logic 
behind the maximum limit of results you can see (unless you can find 
some logic in the fact that I can see _more_ results when I _exclude_ 
repeated ones from the total).

Dawid







Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

Posted by Andrzej Bialecki <ab...@getopt.org>.
Hi,

I made an experiment with Google, to see if they use a similar approach.

I find the results to be most interesting. I selected a query which is 
guaranteed to give large result sets, but is more complicated than a 
single term query: http com.

The total number of hits (approx) is 2,780,000,000. BTW, I find it 
curious that the last 3 or 6 digits always seem to be zeros ... there's 
some clever guesstimation involved here. The fact that Google Suggest is 
able to return results so quickly would support this suspicion.

When I ran the query for the first time, the response time was 0.29 sec. 
All subsequent queries retrieving the first 10 results are in the order 
of 0.07 sec.

This is for retrieving just the first page (first 10 results). 
Retrieving results 10-20 also takes 0.08 sec, which suggests that this 
result was cached somewhere. Starting from results 20+ the response time 
increases (linearly?), although it varies wildly between requests, 
sometimes returning quicker, sometimes taking the max time - which 
suggests that I'm hitting different servers each time. Also, if I wait 
~30 sec to 1 minute, the response times are back to the values for the 
first-time run.

start   first     repeated response
30      0.14      0.08-0.21
50      0.29      0.11-0.22
100     0.36      0.22-0.45
200     0.73      0.49-0.65
300     0.96      0.64-0.98
500     1.36      1.43-1.87
650     2.24      1.49-1.85

The last range was the maximum in this case - Google wouldn't display 
any hit above 652 (which I find curious, too - because the total number 
of hits is, well, significantly higher - and Google claims to return up 
to the first 1000 results).

My impressions from this excercise are perhaps not so surprising: Google 
is highly optimized for retrieving the first couple of results, and the 
more results you want to retrieve the worse the performance. Finally, 
you won't be able to retrieve any results above a couple hundred, quite 
often less than the claimed 1000 results threshold.

As for the exact techniques of this optimization, we'll never know for 
sure, but it seems like something similar is going on to what you 
outlined in your email. I think it would be great to try it out.

Andrzej


Doug Cutting wrote:

> Doug Cutting wrote:
>
>> Implementing something like this for Lucene would not be too 
>> difficult. The index would need to be re-sorted by document boost: 
>> documents would be re-numbered so that highly-boosted documents had 
>> low document numbers.
>
>
> In particular, one could:
>
> 1. Create an array of int[maxDoc], with a[i] = i.
> 2. Sort the array with order(i,j) = boost(i) - boost(j);
> 3. Implement a FilterIndexReader that re-numbers using the sorted 
> array.  So, for example, the document numbers in the TermPositions 
> will a[old.doc()].  Each term's positions will need to be read 
> entirely into memory and sorted to perform this renumbering.
>
> The IndexOptimizer.java class in the searcher package was an old 
> attempt to create something like what Suel calls "fancy postings".  It 
> creates an index with the top 10% scoring postings.  Since documents 
> are not renumbered one can intermix postings from this with the full 
> index.  So for example, one can first try searching using this index 
> for terms that occur more than, e.g., 10k times, and use the full 
> index for rarer words.  If that does not find 1000 hits then the full 
> index must be searched.  Such an approach can be combined with using a 
> pre-sorted index.
>
> I think the first thing to implement would be to implement something 
> like what Suel calls first-1000.  Then we need to evaluate this and 
> determine, for query log, how different the results are.
>
>> Then a HitCollector can simply stop searching once a given number of 
>> hits are found.
>>
>> Doug
>>
>>
>
>


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucene performance bottlenecks

Posted by Doug Cutting <cu...@nutch.org>.
Doug Cutting wrote:
> Implementing something like this for Lucene would not be too difficult. 
> The index would need to be re-sorted by document boost: documents would 
> be re-numbered so that highly-boosted documents had low document 
> numbers.

In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);
3. Implement a FilterIndexReader that re-numbers using the sorted array. 
  So, for example, the document numbers in the TermPositions will 
a[old.doc()].  Each term's positions will need to be read entirely into 
memory and sorted to perform this renumbering.

The IndexOptimizer.java class in the searcher package was an old attempt 
to create something like what Suel calls "fancy postings".  It creates 
an index with the top 10% scoring postings.  Since documents are not 
renumbered one can intermix postings from this with the full index.  So 
for example, one can first try searching using this index for terms that 
occur more than, e.g., 10k times, and use the full index for rarer 
words.  If that does not find 1000 hits then the full index must be 
searched.  Such an approach can be combined with using a pre-sorted index.

I think the first thing to implement would be to implement something 
like what Suel calls first-1000.  Then we need to evaluate this and 
determine, for query log, how different the results are.

> Then a HitCollector can simply stop searching once a given 
> number of hits are found.
> 
> Doug
> 
> 

Re: Lucene performance bottlenecks

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when 
> for any query the response time is well below 1 second. Otherwise the 
> service seems sluggish. Response times over 3 seconds are normally not 
> acceptable.

It depends.  Clearly an average response time of less than 1 second is 
better than an average response time of 3 seconds.  There is no 
argument.  That is a more useful search engine.  But a search engine 
with a 3-second average response time is still much better than no 
search engine at all.  If an institution cannot afford to guarantee 1 
second average response time, must it not run a search engine?  For 
low-traffic, non-commercial search engines, sluggishness is not a fatal 
fault.

> There is a total of 8,435,793 pages in that index. Here's a short list 
> of queries, the number of matching pages, and the average time (I made 
> just a couple of tests, no stress-loading ;-) )
> 
> * hurricane: 1,273,805 pages, 1.75 seconds.
> * katrina: 1,267,240 pages, 1.76 seconds
> * gov: 979,820 pages, 1.01 seconds

These are some of the slowest terms in this index.

> * hurricane katrina: 773,001 pages, 3.5 seconds (!)

This is not a very interesting query for this collection...

> * "hurricane katrina": 600,867 pages, 1.35 seconds
> * disaster relief: 205,066 pages, 1.12 seconds
> * "disaster relief": 140,007 pages, 0.42 seconds
> * hurricane katrina disaster relief: 129,353 pages, 1.99 seconds
> * "hurricane katrina disaster relief": 2,006 pages, 0.705 seconds
> * xys: 227 pages, 0.01 seconds
> * xyz: 3,497 pages,  0.005 seconds

> What I found out is that "usable" depends a lot on how you test it and 
> what is your minimum expectation. There are some high-frequency terms 
> (and by this I mean terms with frequency around 25%) that will 
> consistently cause a dramatic slowdown. Multi-term queries, because of 
> the way Nutch expands them into sloppy phrases, may take even more time, 
> so even for such relatively small index (from the POV of the whole 
> Internet!) the response time may drag into several seconds (try "com").

How often to you search for "com"?

> Response times over several seconds would mean that users 
> would say goodbye and never return... ;-)

So, tell me, where will these users then search for archived web content 
related to hurricane katrina?  There is no other option.  If this were a 
competitive commercial offering, then some sluggishness would indeed be 
unacceptable, and ~10M pages in a single index might be too many on 
today's processors.  But in a non-profit unique offering, I don't think 
this is unacceptable.  Not optimal, but workable.  Should the archive 
refuse to make this content searchable until they have faster or more 
machines, or until Nutch is faster?  I don't think so.

> If 10 mln docs is too much for a single server to meet such a 
> performance target, then this explodes the total number of servers 
> required to handle Internet-wide collections of billions pages...
> 
> So, I think it's time to re-think the query structure and scoring 
> mechanisms, in order to simplify the Lucene queries generated by Nutch - 
> or to do some other tricks...

I think "other tricks" will be more fruitful.  Lucene is pretty well 
optimized, and I don't think qualitative improvements can be had by 
simplifying the queries without substantially reducing their effectiveness.

The trick that I think would be most fruitful is something like what 
Torsten Suel describes in his paper titled "Optimized Query Execution in 
Large Search Engines with Global Page Ordering".

http://cis.poly.edu/suel/papers/order.pdf
http://cis.poly.edu/suel/talks/order-vldb.ppt

I beleive all of the major search engines implement something like this, 
where heuristics are used to avoid searching the complete index.  (We 
certainly did so at Excite.)  The results are no longer guaranteed to 
always be the absolute highest-scoring, but in most cases are nearly 
identical.

Implementing something like this for Lucene would not be too difficult. 
  The index would need to be re-sorted by document boost: documents 
would be re-numbered so that highly-boosted documents had low document 
numbers.  Then a HitCollector can simply stop searching once a given 
number of hits are found.

Doug



Re: Lucene performance bottlenecks

Posted by Andrzej Bialecki <ab...@getopt.org>.
Piotr Kosiorowski wrote:

>Hi,
>I started to think about implementing special kind of Lucene Query (if I
>remember correctly I would have to write my own Scorer and probably a few
>other classes) optimized for Nutch some time ago. I assumed having
>specialized query I would be able to avoid accessing some of lucene index
>structures multiple times as the same term apears many times in query
>generated by Nutch for multitoken queries. I am not an Lucene expert but
>maybe it is worth checking if it might give some performance boost. Has
>anyone any ideas why it might help or not?
>  
>

That's a very good comment. Looking at the profile traces I can see that 
a lot of time is spent just juggling the sub-query scorers inside the 
BooleanScorer, and handling the complex query structure; if this part 
could be optimized by the use of a special scorer, it could be a big win.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Lucene performance bottlenecks

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hi,
I started to think about implementing special kind of Lucene Query (if I
remember correctly I would have to write my own Scorer and probably a few
other classes) optimized for Nutch some time ago. I assumed having
specialized query I would be able to avoid accessing some of lucene index
structures multiple times as the same term apears many times in query
generated by Nutch for multitoken queries. I am not an Lucene expert but
maybe it is worth checking if it might give some performance boost. Has
anyone any ideas why it might help or not?
Regards,
Piotr